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Abstract 

XML is rapidly emerging as the new standard for data representation and exchange on the Web. Unlike HTML, tags in XML 
documents describe the semantics of the data and not how it is to be displayed. In addition, an XML document can be accompanied 
by a Document Type Descriptor (DTD) which plays the role of a schema for an XML data collection. DTDs contain valuable 
information on the structure of documents and thus have a crucial role in the efficient storage of XML data, as well as the effective 
formulation and optimization of XML queries. Despite their importance, however, DTDs are not mandatory, and it is frequently 
possible that documents in XML databases will not have accompanying DTDs. In this paper, we propose XTRACT, a novel 
system for inferring a DTD schema for a database of XML documents. Since the DTD syntax incorporates the full expressive 
power of regular expressions, naive approaches typically fail to produce concise and intuitive DTDs. Instead, the XTRACT 
inference algorithms employ a sequence of sophisticated steps that involve: (1) finding patterns in the input sequences and 
replacing them with regular expressions to generate "general" candidate DTDs, (2) factoring candidate DTDs using adaptations 
of algorithms from the logic optimization literature, and (3) applying the Minimum Description Length (MDL) principle to 
find the best DTD among the candidates. The results of our experiments with real-life and synthetic DTDs demonstrate the 
effectiveness of XTRACT's approach in inferring concise and semantically meaningful DTD schemas for XML databases. 

1 Introduction 

Motivation and Background* The genesis of the Extensible Markup Language (XML) was based on the thesis that structured 
documents can be freely exchanged and manipulated, if published in a standard, open format. Indeed, as a corroboration of the 
thesis, XML today promises to enable a suite of next-generation Web applications ranging from intelligent web searching to 
electronic commerce. 

In many respects, XML data is an instance of semistructured data [Abi97]. XML documents comprise hierarchically nested 
collections of elements, where each element can be either atomic (i.e., raw character data) or composite (i.e.. a sequence of 
nested subelements). Further, tags stored with elements in an XML document describe the semantics of the data rather than 
simply specifying how the element is to be displayed (as in HTML). Thus. XML data, like semistructured data, is hierarchically 
structured and self-describing. 

A characteristic, however, that distinguishes XML from semistructured data models is the notion of a Document Type De- 
scriptor (DTD) that may optionally accompany an XML document. A document*s DTD serves the role of a schema specifying 
the internal structure of the document. Essentially , a DTD specifies for every element, the regular expression pattern that subele- 
ment sequences of the element need to conform to. DTDs are critical to realizing the promise of XML as the data representation 
format that enables free interchange of electronic dau (EDI) and integration of related news, products, and services information 
from disparate data sources. This is because, in the absence of DTDs, tagged documents have little meaning. However, once 
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the major software vendors and corporations agree on domain-specific standards for DTD formats, it would become possible for 
inter-operating applications to extract, interpret, and analyze the contents of a document based on the DTD that it conforms to. 

In addition to enabling the free exchange of electronic documents through industry-wide standards^ DTDs also provide the 
basic mechanism for defining the structure of the underlying XML data. As a consequence. DTDs play a crucial role in the 
efficient storage of XML data as well as the formulation, optimization, and processing of queries over a collection of XML 
documents. For instance, in (SHT+99]. DTD information is exploited to generate effective relational schemas, which are subse- 
quently employed to efficiently store and query entire XML documents in a relational database. In [DFS99], frequently occurring 
portions of XML documents are stored in a relational system, while the remainder is stored in an overflow graph; once again, 
the DTD is exploited to simplify overflow mappings. Similarly, DTDs can be used to devise efficient plans for queries and thus 
speed up query evaluation in XML databases by restricting the search to only relevant portions of the data (see, for example, 
[GW97, FS97]). The basic idea is to use the knowledge of the structure of the data captured by the DTD to prune elements that 
cannot potentially satisfy the path expression in the query. Finally, by shedding light on how the underlying data is structured, 
DTDs aid users in forming meaningful queries over the XML database. 

Despite their importance, however, DTDs are not mandatory and an XML document may not always have an accompanying 
DTD. In fact, several recent papers (e.g., [GMW99, Wid91]) claim that it is firequently possible that only specific portions of XML 
databases will have associated DTDs, while the overall database is still "schema-less". This may be the case, for instance, when 
large volumes of XML documents are automatically generated from data stored in relational databases, flat files (e.g., HTML 
pages, bibliography files), or other semistnictured data repositories. Since very little data is in XML format today, it is very likely 
that, at least initially, the majority of XML documents will be automatically generated firom pre-existing data sources by a new 
generation of software tools. In most cases, such autoinatically-created document collections will not have an accompanying 
DTD. 

Therefore, based on the above discussion on the virtues of a DTD, it is important to devise algorithms and tools that can infer 
an accurate, meaningful DTD for a given collection of XML documents (i.e., instances of the DTD). This is not an easy task. 
Since the DTD syntax incorporates the full specification power of regular expressions, manually deducing such a DTD schema 
for even a small set of XML documents created by a user could prove to be a process of daunting complexity. Furthermore, as 
we show in this paper, naive approaches fail to deliver meaningful and intuitive DTD descriptions of the underlying data. Both 
problems are, of course, exacerbated for large XML document collections. In light of the several benefits of DTDs, we can 
motivate a myriad of potential applications for efficient, automated DTD discovery tools. For example, users or domain experts 
looking for a meaningful description of their XML data can use the DTD description returned by such tools as a starting point 
from which more refined schemas can be generated. As another application, consider an employment web site that integrates 
information on job openings from thousands of different web sites including company home pages, newspaper classified sites, 
and so on. These XML documents, although related, may not all have the same structure and, even if some of the documents 
are accompanied by DTDs, the DTDs may not be identical. An alternative to manually transforming all the XML documents 
to conform to a single format would be to simply store the documents in their original formats and use DTD discovery tools 
to derive a single DTD description for the entire database. This inferred DTD can then help in the formulation, optimization, 
and processing of queries over the database of stored XML documents. Finally, the ability to extract DTDs for a range of XML 
formats supported by the major participants in a specific industrial setting can also aid in the DTD standardization process for the 
industry. 

Our Contributions. In this paper, we describe the architecture of XTRACT, a novel system for inferring an accurate, meaning- 
ful DTD schema for a repository of XML documents. A naive and straightforward solution to our DTD extraction problem would 
be to infer as the DTD for an element, a "concise" expression which describes exactly all the sequences of subelements nested 
within the element in the entire document collection. As we demonstrate in Section 3, however, the DTDs generated by this ap- 
proach tend to be voluminous and unintuitive (especially for large XML document collections). In fact, we discover that accurate 
and meaningful DTD schemas that are also intuitive and appealing to humans (i.e., resemble what a human expert is likely to 
come up with) tend to generalize. That is, "good" DTDs are typically regular expressions describing subelement sequences that 
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' may not actually occur in the input XML documents. (Note that this, in fact, is always the case for DTD regular expressions 
that correspond to infinite regular, languages, e.g., DTDs containing one or more Kleene stars [HU79].) In practice, however, 
there are numerous such candidate DTDs that generalize the subelement sequences in the input, and choosing the DTD that best 
describes the structure of these sequences is a non-trivial task. In the inference algorithms employed in the XTRACT system, 
we propose the following novel combination of sophisticated techniques to generate DTD schemas that effectively capture the 
structure of the input sequences. 

• Generalization. As a first step, the XTRACT system employs novel heuristic algorithms for finding patterns in each 
input sequence and replacing them with appropriate regular expressions to produce more general candidate DTDs. The 
main goal of the generalization step is to judiciously introduce metacharacters (like Kleene stars "*") to produce regular 
subexpressions that generalize the patterns observed in the input sequences. Our generalization heuristics are based on the 
discovery of frequent, neighboring occurrences of subsequences and symbols within each input sequence. In their effort 
to introduce a sufficient amount of generalization while avoiding an explosion in the number of resulting patterns, our 
techniques are inspired by practical, real-life DTD examples. 

• Factoring. As a second step, the XTRACT system factors common subexpressions from the generalized candidate DTDs 
obtained from the generalization step, in order to make them more concise. The factoring algorithms applied are appropriate 
adaptations of techniques from the logic optimization literature [BM82, Wan89). 

• Minimum Description Length (MDL) Principle. In the final and most important step, the XTRACT system employs 
Rissanen's Minimum Description Length (MDL) principle [Ris78, Ris89] to derive an elegant mechanism for composing 
a near-optimal DTD schema from the set of candidate DTDs generated by the earlier two steps. (Our MDL-based notion 
of optimality will be defined formally later in the paper.) The MDL principle has its roots in information theory and, 
essentially, provides a principled, scientific definition of the optimal "theory/model" that can be inferred firom a set of data 
examples tQR89b]. Abstracdy. in our problem setting, MDL ranks each candidate DTD depending on the number of bits 
required to describe the input collection of sequences in terms of the DTD (DTDs requiring fewer bits are ranked higher). 
As a consequence, the optimal DTD according to the MDL principle is the one that is general enough to cover a large 
subset of the input sequences but, at the same time, captures the structure of the input sequences with a fair amount of 
detail, so that they can be described easily (with few additional bits) using the DTD. Thus, the MDL principle provides 
a formal notion of "best DTD" that exactly matches our intuition. Using MDL essentially allows XTRACT to control 
the amount of generalization introduced in the inferred DTD in a principled, scientific and, at the same time, intuitively 
appealing fashion. 

We demonstrate that selecting the optimal DTD based on the MDL principle has a direct and natural mapping to ihc facility 
location problem (FLP), which is known to be NP-complete [Hoc82]. Fortunately, efficient approximation algorithms with 
guaranteed performance ratios have been proposed for the FLP in the literature [CG99], thus allowing us to efficiently 
compose the final DTD in a near-optimal manner. 

We have implemented our XTRACT DTD derivation algorithms and conducted an extensive experimental study with both 
real-life and synthetic DTDs. Our findings show that, for a set of random inputs that conform to a predetermined DTD, XTRACT 
always produces a DTD that is either identical or very close to the original DTD. We also observe that the quality of the DTDs 
returned by XTRACT is far superior compared to those output by the IBM alphaworks' DDbE (Data Descriptors by Example) 
DTD extraction tool, which is unable to identify a majority of the DTDs. Further, a number of the original DTDs correctly 
inferred by XTRACT contain several regular expressions terms, some nested within one another. Thus, our experimental results 
clearly demonstrate the effectiveness of XTRACT's methodology for deducing fairly complex DTDs. 

Several extensions to DTDs, e.g.. Document Content Descriptors (DCDs) and XML Schemas, are being evolved by the Web 
community. These extensions aim to add typing information since DTDs treat all data as strings. Therefore, XTRACT, can be 
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used with little or no changes for inferring DCDs and XML Schemas in conjunction with other mechanisms for inferring the 
types. However, these proposals are still evolving and none of them have stabilized - therefore, we do not concentrate on these 
extensions in this paper. 

Roadmap. The remainder of the paper is organized as follows. After discussing related work in Section 2, we present an 
overview of our approach to inferring DTDs in Section 3. Section 4 describes how the MDL principle is employed within 
XTRACT to compose a "good" DTD from an input set of candidate DTDs. In Sections 5 and 6, we present generalization and 
factoring algorithms for producing candidate DTDs that are input to the MDL module of XTRACT. Section 7 discusses the results 
of our experiments with real-life and synthetic DTDs. Finally, we offer concluding remarks in Section 8. 

2 Related Work 

The problem of mining DTDs from a collection of XML documents, to the best of our knowledge, is novel and has not been 
previously addressed in the literature. A few DTD extraction software tools can be found on the Web (e.g., the IBM alphaworks 
DDbE product) - however, it has been our experience that these tools are somewhat naive in their approach and the quality of the 
DTDs inferred by them is poor (see Section 7). 

The problem of extracting a schema from semistructured data has been addressed in (NAM98, GW97, FS97]. Although, XML 
can be viewed as an instance of semistructured data, the kinds of schema considered in [NAM98, GW97, FS97] are very different 
from a DTD. The schema extracted by [NAM98. GW97, FS97] attempt to find a typing for semistructured data. Assuming a 
gra^h-based model for semistructured data (nodes denote objects and labels on edges denote relationships between them), finding 
a typing is tantamount to grouping objects that have similarly labeled edges to and from similarly typed objects. The typing then 
describes this grouping in terms of the labels of the edges to (from) this type of objects and the types of the objects at the other , 
end of the edge. In contrast, one can perhaps view the DTD as having already grouped all objects based on their incoming edges ' 
(tag of the element) into the same type and then describing the possible sequence of outgoing edges (subelements) as a regular 
expression. It is the fact that the outgoing edges from a type can be described by an arbitrary regular expression that distinguishes 
DTDs from the schemas in semistructured databases. Since the schemas in semistructured databases are expressed using plain 
sequences or sets of edges, they cannot be used to infer DTDs corresponding to arbitrary regular expressions. 

Inference of formal languages from examples has a long and rich history in the field of computational learning theory, and 
more related to our work is the extensive study of the inference of DFAs (deterministic finite automata) [Gol67, Gol78, Ang78] 
(see also (Pit89] for a detailed survey of the topic). The above line of work is purely theoretical and it focuses on investigating the 
computational complexity of the language inference problem, while we are mainly interested in devising practical algorithms for 
real world applications. In this sense, our research is more closely related to the work in [Bra93] which addressed the problem 
of approximating roughly equivalent regular expressions from a long enough string, and the work in [KMU95] where the MDL 
principle was used to infer a pattern language from positives examples. However, the problem tackled in [KMU951 is much 
simpler than ours since they assume that the set of simple patterns whose subset is to be computed is available. Furthermore, the 
patterns they consider are simple sequences that are permitted to contain single symbol wildcards. In our problem setting, unlike 
(KMU95], patterns are general regular expressions and are not known apriori. 

3 Problem Formulation and Overview of our Approach 

In this section, we present a precise definition of the problem of inferring a DTD from a collection of XML documents and then 
present an overview of the steps performed by the XTRACT system. But first, we present a brief overview of XML and DTDs in 
the following subsection to make the subsequent discussion concrete. 
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<article> 

<title> A Relational Model for Large Shared Data Banks </title> 
<author> 

<naine> E. F. Codd </naine> 
<affiliation> IBM Research </af f iliation> 
</author> 
</article> 



Figure 1: An Example XML Document 

<! ELEMENT article ( title, author*) > 

<! ELEMENT title (#PCDATA)> 

<! ELEMENT author (name, af filiation) > 

<! ELEMENT name (# PCDATA )> 

<! ELEMENT affiliation (#PCDATA)> 



Figure 2: An Example DTD 

3.1 Overview of XML and DTDs 

XML document like an HTML document, consists of nested element stn.ctures starting with a root element Sube.ements 
of » elemen can e.ther be elements or simply character data. Figure 1 illustrates an example XML document, in which the 
(art.cle) has two nested subelements (title and author), and the author element in turn has tro nes Ld 
>ubele„,ents. The 1. 1 le element contains character data denoting the title of the article while the name element con^s t e 
nat^e of theauthor of the article. The ordering of subelements within an element is significant in XML. Elements can a^o 

" '^^^^ - ^ specification can 1 

, Ji^™ ^ ' T™" the structure of an XML document A DTD constrains the stnicture of an element by 

specfymg a regular expression that its subelement sequences have to conform to. Figure 2 illustrates a DTD that the XML 
document m Figure 1 confonris to. TT,e DTD declaration syntax uses commas for sequencing. | for (exclusive) or. parenthesis for 

oTZ t *r 7" " " " " °' more oclences of th 

preceding term. As a special case, the DTO con^sponding to an element can be ANY which allows an arbitrary XML fragment 

to be nes ed w.thm the e ement. The DTD can also be used to specify the attributes for an element (using^e < , attlist 

> declaratton) and to declare an attribute that refers to another element (via an IDREF field). We must point out that real-life 

rrlw' C '''"^^ ^''P'^"'^"^ '«™s levels of nesting (e g 

((o6) c) ). We present examples of real-life DTDs in sections 5 and 7. 

\2l " °' "'""'^ ^" ^ 'e"" f™- 'he lower case 

phab . Also we do not mclude explicit commas in element sequences and regular expressions since they can be inferred in a 
straightforward fashion. 



3.2 Problem Definition 



Our primary focus in this paper is to infer a DTD for a collection of XML documents. Tl^us. for each element that appears in the 
XML documents, our goal is to derive a regular expression that subelement sequences for the element (in the XML documents) 

jfonnto.Notethatanelement-sDTDiscompleteIyindependentoftheDTDforotherelements.andonly 

of subelemen.s nested within the element. Therefore, for simplicity of exposition, in the rest of the paper, we concentrate on the 
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problem of extracting a DTD for a single element. In this paper, we do not address the problem of computing attribute lists for 
an element - since these are simple lists, their computation is not particularly challenging. 

Let e be an element that appears in the XML documents for which we want to infer the DTD. It is straightforward to compute 
the sequence of subelements nested within each < e >< /e > pair in the XML documents. Let / denote the set of N such 
sequences, one sequence for every occurrence of element e in the data. The problem we address in this paper can be stated as 
follows. 

Problem Statement Given a set / of iV input sequences nested within element e, compute a DTD for e such that every sequence 
in / conforms to the DTD. □ 

As stated, an obvious solution to the problem is to find the most "concise" regular expression R whose language is /. One 
mechanism to find such a a regular expression is to factor as much as possible, the expression corresponding to the or of sequences 
in /. Factoring a regular expression makes it "concise" without changing the language of the expression. For example, ab\ac can 
be factored into a(6|c). An alternate method for computing die most concise regular expression is to first find the automaton with 
the smallest number of states that accepts / and then derive the regular expression from the automaton (note that the obtained 
regular expression, however, may not be the shortest regular expression for /), In any case, such a concise regular expressipn 
whose language is /, is unfortunately not a "good" DTD in the sense it tends to be voluminous and unintuitive. We illustrate 
this using the DTD of Figure 2. Suppose we have a collection of XML documents that conform to this DTD. Abbreviating the 
title tag by t, and the author tag by o, it is reasonable to expect the following sequences to be the subelement sequences of 
the article element in the collection of XML documents: t. ta, taa, taaa, taaaa. Clearly, the most concise regular expression 
for die above language is t|t(o|a(a|o(a|aa))) which is definitely much more voluminous and lot less intuitive than a DTD such 
BSta*. 

In other words, the obvious solution above never "generalizes" and would therefore never contain metacharacters like * 
in the inferred DTD. Clearly, a human being would at most times want to use such metacharacters in a DTD to succinctly 
convey the constraints he/she wishes to impose on the structure of XML documents. Thus, the challenge is to infer for the 
set of input sequences /. a "general" DTD which is similar to what a human would come up with. However, as the following 
example illustrates, there can be several possible "generalizations" for a given set of input sequences and thus we need to devise 
a mechanism for choosing the one that best describes the sequences. 

Example 3.1 Consider / - {aft, abab, ababab}, A number of DTDs match sequences in / - (1) (a | 6)*, (2) ab \ abab \ ababab, 
(3) {ab)*, (4) ab \ ab{ab \ abab), and so on. DTD (I) is similar to ANY in that it allows any arbitrary sequence of as and 6s, 
while DTD (2) is simply an or of all the sequences in /. DTD (4) is derived from DTD (2) by factoring the subsequence ab from 
the last two disjuncts of DTD (2). The problem with DTD (1) is that it represents a gross over-generalization of the input, and 
the inferred DTD completely fails to capture any structure inherent in the input. On the other hand. DTDs (2) and (4) accurately 
reflect the structure of the input sequences but do not generalize or learn any meaningful patterns which make the DTDs smaller 
or simpler to understand. Thus, none of the DTDs (1), (2) or (4) seem "good". However, of the above DTDs, (3) has great 
intuitive appeal since it is succinct and it generalizes the input sequences without losing too much information about ihe structure 
of the input sequences. □ 

Based on the discussion in the above example, we can characterize the set of desirable DTDs by placing the following two 
qualitative restrictions on the inferred DTD. 

Rl: The DTD should be concise (i.e., small in size). 

R2: The DTD should be precise (i.e, not cover too many sequences not contained in /). 

Restriction Rl above ensures diat die inferred DTD is easy to understand and succinct, thus eliminating, in many cases, concise 
regular expressions whose language is /. Restriction R2, on the other hand, attempts to ensure Uiat the DTD is not too general and 
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■ captures the structure of input sequences, thus eliminaUng a DTD such as ANY. While the above restrictions seem reasonable at 
an .ntuiuve level, there is a problem with devising a solution based on the above restrictions. The problem is that restrictions Rl 
and R2 conflict wth each other. In our earlier example, restriction Rl would favor DTDs (1) and (3). while these DTDs would 
not be considered good according to criterion R2. The situation is exactly the reverse when we consider DTOs (2) and (4) Thus 
m general, there is a tradeoff between a DTD's "conciseness" and it's "preciseness". and a good DTD is one that strikes the righJ 
balance between the two. -n>e problem here is that conciseness and preciseness are qualitative notions - in order to resolve L 
tradeoff between the two. we need to devise quantitative measures for mathematically capturing the two qualitative notions. 

3.3 Using the MDL Principle to Define a Good DTD 

We use d,e MDL principle [Ris78, Ris89] to define an information-theoretic measure for quantifying and thereby resolving the 
tradeoff between the conciseness and preciseness properties of DTDs. The MDL principle has been successfully applied in the 
past m a vanety of s.tuauons ranging from constructing good decision tree classifiers [QR89a, MRA95] to learning common 
patterns m sets of strings [KMU95]. 

Roughly speaking, the MDL principle states that the best tiieory to infer from a set of data is the one which minimizes the sum 

(A) tiie length of the theory, in bits, and 

(B) the length of the data, in bits, when encoded wiUi the help of the theory. 

We will refer to the above sum. for a theory, as the MDL cost for the theory. The MDL principle is a general one ai,d needs to be 
.nstanuated appropriately for each situation. In our setting, the theory is the DTD and the data is the sequences in / Thus the 
MDL pnnc.ple assigns each DTD an MDL cost and ranks ti.e DTDs based on their MDL costs (DTDs with lower MDL costi are 
J ranked higher). Furthennore. parts (A) and (B) of the MDL cost for a DTO depend diiecUy on iis conciseness and preciseness 
respecuvely^ Part (A) is the number of bits required to describe the DTO and is thus a direct measure of its conciseness. Further' 
smce a DTO that is more precise captures the structure of the input sequences more accurately, fewer bits are required to describe' 
the sequences in / in tenns of a more precise DTO. As a result. Part (B) of the MDL cost captures a DTO's preciseness. The 
MDL cost for a DTO thus provides us with an elegant and principled mechanism (rooted in information theory) for quantifying 
(and combmmg) the conflicting concepts of conciseness and preciseness in a single unified framework, and in a manner that is 
conststent wiA our intuition. By favoring concise and precise DTOs. and penalizing those that are'nt, it ranks highly exactly 
those DTOs that would be deemed desirable by humans. 

Note that the actual encoding scheme used to specify a DTO as well as the data (with the help of the DTO) plays a critical role 
m determming the actual values for the two components of the MDL cost. We defer the details of the actual encoding scheme 
to Section 4. However, in the following example, we employ a simple encoding scheme (a coarser version of the scheme in 
Section 4) to illustrate how ranking DTOs based on their MDL cost closely matches our intuition of Uieir goodness. 

Example 3.2 Consider the input set / and DTOs from Example 3.L We compute the MDL cost of each DTD which as 
^enuoned earlier, is the cost of encoding the DTO itself and the sequences in / in terms of the UTD. We then rank (he DTOs 
-based on their MDL costs (DTOs wiU, smaller MDL costs are considered better). In our simple encoding scheme, we assume a 
cost of 1 unit for each character. 

DTO (1). (a I b)', has a cost of 6 for encoding the DTO. In order to encode the sequence abab using the DTD we need one 
character to specify the number of repetitions of the the term (a | 6) that precedes the ' (in this case, this number is 4) and 
4 additional characters to specify which of o or 6 is chosen from each repetition. TTius, the total cost of encoding abab using 
a 16) ,s 5 and the MDL cost of the DTO is 6 + 3 + 5 + 7 = 2L Similarly, the MDL cost of DTO (2) can be shown to be 14 
(to encode the DTO) + 3 (to encode the input sequences; we need one character to specify the position of the disjunct for each 
sequence) = 1 7. The cost of DTO (3) is 5 (to encode the DTO) + 3 (to encode the input sequences - note that we only need to 
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Input Sequences 

/={ ab, abab, ac, ad, be, bd, bbd* bbbbe ) 




MDL 
Module 



I Inferred DTD:(ab)* I (albKcId) I b*(dle) 

Inferred DTD:(ab)* t (alb)(ctd) I b*(dle) 

(a) (b) 
Figure 3: Architecture of the XTRACT System 

specify the number of repetitions of the term ab for each sequence) = 8. Finally, DTD (4) has a cost of 14 + 5 (1 character to 
encode sequence ab and 2 characters for each of the other two input sequences) = 19. 

Thus, since DTD (3) has the least MDL cost, it would be considered the best DTD by the MDL principle - which matches our 
intuition. □ 

From the above example, it follows that the MDL principle indeed provides an elegant mechanism for quantifying and resolv- 
ing the tradeoff between the conciseness and preciseness properties of DTDs. Specifically, 

1 . Part (A) of the MDL cost includes the number of bits required to encode the DTD - this ensures that the inferred DTD is 
succinct. 

2. Part (B) of the MDL cost includes the number of bits needed for encoding the input sequences using the DTD. Usually, 
expressing data in terms of a more general DTD (e.g., (a | 6)* in Example 3.2) requires more bits than describing data in 
terms of a more specific DTD (e.g., {ab)* in Example 3.2). As a result, using the MDL principle ensures that the DTD we 
choose is a fairly tight characterization of the data. 

The MDL principle, thus, enables us to choose a DTD that strikes the right balance between conciseness and preciseness. 

3.4 Overview of the XTRACT System 

The architecture of the XTRACT system is illustrated in Figure 3(a). As shown in the figure, the system consists of three main 
components: the generalization module, the factoring module and the MDL module. Input sequences in / are processed by the 
three subsystems one after another, the output of one subsystem serving as input to the next. We denote the outputs of the gen- 
eralization and factoring modules by Sg and S^, respectively. Observe that both Sg and contain the initial input sequences 
in /. This is to ensure that the MDL module has a wide range of DTDs to choose from that includes the obvious DTD which is 
simply an or of all the input sequences in /. In the following, we provide a brief description of each subsystem; we defer a more 
detailed description of the algorithms employed by each subsystem to later sections. 



The Generalization Subsystem. For each input sequence, the generalization module generates zero or more candidate DTDs 
that are derived by replacing patterns in the input sequence with regular expressions containing metacharacters like * and | (e.g., 
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(ab)', (o I b)*). Note that the initial input sequences do not contain metacharacters and so the candidate DTDs introduced by 
the generalization module are more general. For instance, in Figure 3(a), sequences 0606 and bbbe result in the more general 
candidate DTDs (ab)', (o | b)' and b*e to be output by the generalization subsystem. Also, observe that each candidate DTD 
produced by the generalization module may cover only a subset of the input sequences. Thus, the final DTD output by the MDL 
module may be an or of multiple candidate DTDs. 

Ideally, in the generalization phase, we should consider all DTDs that cover one or more input sequences as candidates so 
that the MDL step can choose the best among them. However, the number of such DTDs can be enormous. For example, the 
sequence ababaabb is covered by the following DTDs in addition to many more - (a | 6)', (a | 6)'a*6*, (a6)*(a | 6)*, (a6)*a'*6*. 
Therefore, in this paper, we ouUine several novel heuristics, inspired by real-life DTDs^, for limiting the set of candidate DTDs 
Sg output by the generalization module. 

The Factoring Subsystem. The factoring component factors two or more candidate DTDs in Sg into a new candidate DTD. The 
length of the new DTD is smaller than the sum of the sizes of the DTDs factored. For example, in Figure 3(a). candidate DTDs 
b'd and b'e representing the expression b'd \ b*e, when factored, result in tiie DTD 6'(d | e); similarly, the candidates ac, ad, be 
and bd are factored into (o | 6)(c | d) (the pre-factored expression is ac \ ad \ be \ bd). Although factoring leaves the semantics 
of candidate DTDs unchanged, it is nevertheless an important step. The reason being that factoring reduces the size of Uie DTD 
and thus the cost of encoding the DTD. without seriously impacting the cost of encoding input sequences using the DTD. Thus, 
since the DTD encoding cost is a component of the MDL cost for a DTD. factoring can result in certain DTDs being chosen by 
the MDL module that may not have been considered earlier. We appropriately modify factoring algorithms for boolean functions 
in the logic optimization area [BM82. Wan89] to meet our needs. However, even though every subset of candidate DTDs can, in 
principle, be factored, tiie number of these subsets can be large and only a few of them result in good factorizations. We propose 
novel heuristics to restrict our attention to subsets that can be factored effectively. 

r") 

. . The MDL Subsystem. The MDL subsystem finally chooses from among the set of candidate DTDs Sjr generated by the previous 
two subsystems? a set of DTDs that cover all the input sequences in / and the sum of whose MDL costs is minimum. The final 
DTD is then an or of the DTDs in Uie set. For tiie input sequences in Figure 3(a). we illustrate (using solid lines) in Figure 3(b), 
the input sequences (in the right column) covered by Uie candidate DTDs in Syr (in tiie left column). 

The above cost minimization problem naturally maps to the Facility LocaHon Problem (FLP) for which polynomial time 
approximation algoritiims have been proposed in tiie literature [Hoc82, CG99J. We adapt tiie algoritiim from [CG99] for our 
purposes, and using it, tiie XTRACT system is able to infer tiie DTD shown at tiie bottom of Figure 3(b). 

4 The MDL Subsystem 

The MDL subsystem constitutes tiie core of tiie XTRACT system - it is responsible for choosing a set S of candidate DTDs from 
Syr such that tiie final DTD V (which is an or of tiie DTDs in S) (1) covers all sequences in /. and (2) has the minimum MDL 
cost Consequentiy. we describe Uiis module first, and postpone tiie presentation of tiie generalization and factoring modules to 
f )Sections 5 and 6, respectively. 

Recall that tiie MDL cost of a DTD tiiat is used to explain a set of sequences, comprises of 

(A) the lengtii. in bits, needed to describe tiie DTD. and 

(B) the lengtii of tiie sequences, in bits, when encoded in terms of tiie DTD. 

Thus, in tiie following subsection, we first present tiie encoding schemes for computing parts (A) and (B) of the MDL cost of 
a DTD. Subsequentiy. in Section 4.2, we present tiie algoritiim for computing tiie set S CSjr of candidate DTDs whose or 
^The DTDs are available ai Robin Cover's SGML/XML Web page (httpy/www.oasis-open.oi^cover/). 
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(A) seq{D, s) = e 'lf D ~ a. In this case, the DTD Z? is a sequence of symbols from the alphabet E and does not contain any 
metacharacters. 

(B) seq{D\ ,..Dk,$i ...5jk) = 8eq{D\,s\) . . .seq{Dk,Sk) that is. D is the concatenation of regular expressions Di ...Dib 
and the sequence s can be written as the concatenation of the subsequences 5i . . . 5fc , such that each subsequence Si matches 
the corresponding regular expression 

(C) seq{Di | . . . \Djn, s) — % seq{Di, s) that is, D is the exclusive choice of regular expressions I^i . . . Dm» and % is the index 
of the regular expression that the sequence 8 matches. Note that we need fl^^g"*! bits to encode the index i. 



. kseq{Dy3{),.,8eq{D,Sk) if fc > 0 
(D) seq{D\sx,,.Sk) = \ ^ . 

0 otherwise 



In other words, the sequence 5 = 5i . . . 5jk is produced from D* by instantiating the repetition operator k times, and each 
subsequence Si matches the i-th instantiation. In this case, since there is no simple and inexpensive way to bound apriori, 
the number of bits required for the index fc, we first specify die number of bits required to encode fc in unary (that is. a 
sequence of flog fc] Is, followed by a 0) and then the index fc using flogfe] bits. The 0 in the middle serves as the delimiter 
between the unary encoding of the length of the index and the actual index itself. 



Figure 4: The Encoding Scheme 

yields the final DTD V with die minimum MDL cost Note that the candidate DTDs in S;p can be complex regular expressions 
(containing *, | etc.) output by the generalization and factoring subsystems. 

4.1 The encoding scheme 

We begin by describing the procedure for estimating the number of bits required to encode the DTD itself (part (A) of the MDL 
cost). Let S be the set of subelement symbols that appear in sequences in /. Let >f be the set of metacharacters |,* , -h, ?, (, ). 
Let the length of a DTD viewed as a string in E U M, be n. Then, the length of the DTD in bits is n log(| E | -h | |). As an 
example, let E consist of the elements a and 6. The length in bits of the DTD a* 5* is 4 * log(2 + 6) = 12. Similarly, the length 
in bits of the DTD {ab\abb){aa\ah*) is 16 ♦ 3 = 48. 

We next describe the scheme for encoding a sequence using a DTD (part (B) of the MDL cost). The encoding scheme 
constructs a sequence of integral indices (which forms the encoding) for expressing a sequence in terms of a DTD, The following 
simple examples illustrate the basic building blocks on which our encoding scheme for more complex DTDs is built: 

1 . The encoding for the sequence a in terms of the DTD a is the empty string e. 

2. The encoding for the sequence b in terms of the DTD a | 6 | c is the integral index 1 (denotes that b is at position I , counting 
from 0, in the above DTD). 

3. The encoding for the sequence 666 in terms of the DTD 6* is the integral index 3 (denotes 3 repetitions of 6). 

We now generalize the encoding scheme for arbitrary DTDs and arbitrary sequences. Let us denote the sequence of integral 
indices for a sequence s when encoded in terms of a DTD D by seq{D, 5). We define seq{D, s) recursively in terms of component 
DTDs within D as shown in Figure 4. Thus, seq{D,8) can be computed using a recursive procedure based on the encoding 
scheme of Figure 4. Note that we have not provided the definitions of the encodings for operators and ? since these can be 
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• defined in a similar fashion to • (for ^ * is always greater than 0. while for ?. k can only assume values 1 or 0). We now illustrate 
the encoding scheme using the following example. c now uiustrate 

how steps (A). (B). (C) and (D) m F.gure 4 are recursively applied to derive the encoding se<i({ab\cnde\fg'),abccabfggg). 

1. Apply Step (B). seqi{ab\cy, abccab))seq({de\fg*), fggg) 

2. Apply Step (D). 4 seq{ab\c, ab) seq{ab\c, c) seq{ab\c, c) seq{ab\c, ab) seq{{de\fg'), fngq) 

3. Apply Step (C). 4 0 seq{ab, ab) 1 seq{c, c) 1 seq{c, c) 0 seq{ab, ab) 1 seq{fg*, fggg) 

4. Apply Step (A). 4 0 110 1 seq(fg', fggg) 

5. Apply Steps (A), (B) and (D). 4 0 1 1 0 1 3 

In order to derive the final bit sequence corresponding to the above indices, we need to include in the encoding the unary 
representation for the number of bits required to encode the indices 4 and 3. Tl,us. we obtain the following bit enc^i g foZ 
sequence (we have mserted blanks in between the encoding for successive indices for clarity) 



seg((a6|c)*(de|/5*),a&cca6/5(7g) = 1110100 0 1 1 0 1 11011 



□ 



In steps (B) (C) and (D). we need to be able to determine if a sequence s matches a DTD D. Since a DTD is a regular 

aZder 

a non-determmmic finite automaton for D and can also be used to decompose Ae sequence . into subsequences such that each 
J subsequence matches the corresponding sub-part of *e DTD D, thus enabling us to Le up with the encoding 

su^ZTi^T-ZTirT ^ -^'-q-ce matches the corresponding 

ub iZ Tn " P"^"'"" '° ^^^'y decomposition of . that match 

LIs^Td ^' r ; decompositions, the one that results in *e minimum length encoding of s in 

orxrcr":;zr:r " •^^^^ - — - i„ 

4.2 Computing the DTD with Minimum MDL Cost 

We now turn our attenUon to die problem of computing the final DTD V (which is an or of a subset 5 of candidate DTDs in 5^) 

prob etn maps natu^lly to the FacUity location Problem (FLP) [Hoc82. CG99]. The FLP is formulated as follows: Let C be a 
^t of chents and 7 be a set of facilities such that each facility "serves" every client. TTiere is a cost c(i) of "choosing" a facility 
^6 7andacostd(,.0 of serving client i € C by facility i e J. The problem definition asks to choose a subset of facilities 
min'^I d th' ^'^'^'^ •''^"'"^ '"'^ ^^''^-^ ''^ '^'^^^^ facility 



«»n{53c(i) + 5;mind(i,i)} 

^nH^ri"'"" '""^ ^° ^ Let C be the set / of input sequences 

and J be the set of candidate DTDs in 5^. The cost of choosing a facility is the length of die corresponding candidate DTD TTie 
cost of serving client i from facility j, d(j, ,). is the length of the encoding of the sequence coiresponding to i using the DTD 
cor«sponding to the facility j. If a DTD ; does not cover a sequence i, then we set .) to oo. TTius. the set F computed by 
me ^LF corresponds to our desired set 5 of candidate DTDs. 



1! 



The FLP is NP-hard; however, it can be reduced to the set cover problem and then approximated within a logarithmic factor 
as shown in [Hoc82]. In our implementation, we employ the randomized algorithm from [CG99] which approximates the FLP 
within a constant factor if the distance function is a metric. Even though our distance function is not a metric, we have found the 
FLP approximations produced by [CG99] for our problem setting to be very good in practice. Furthermore, the time complexity 
of [CG99] for computing the approximate solution is 0{N^ • log AT), where N = |/|. 



5 The Generalization Subsystem 

The quality of the DTD computed by the MDL module is very dependent on the set of candidate DTDs 5^ input to it. In case 
Sjr were to contain only input sequences in /. then the final DTD output by the MDL subsystem would simply be the or of all 
the sequences in /. However, as we observed earlier, this is not a desirable DTD since it is neither concise nor intuitive. Thus, in 
order to infer meaningful DTDs, it is crucial that the candidate DTDs in be general - the goal of the generalization component 
is to achieve this objective by inferring a set Sg of general DTDs which are then input to the factorization step. As we mentioned 
before, the factorization step infers additional factored DTDs and generates Syr which is a superset ofSg. 

The generalization component in XTRACT infers a number of regular expressions which we have found to frequently appear 
in real-life DTDs. Below, we present examples of such regular expressions from real-life DTDs that appear in the Newspaper 
Association of America (NAA) Classified Advertizing Standards XML DTD^. 

a*bc*: DTDs of this form are generally used to specify tuples with set-valued attributes. 

<! ELEMENT account-info (accoiint -number, sub-accoimt-number*) > <!-- 
Specification for account identification information --> 

(abc)*: This type of DTD is used to represent a set (or a list) of ordered tuples. 

<! ELEMENT days-and-hours (date, time)+> <! — provide times/dates 
when job fairs will be held — > 

(a|6|c)*: The DTD of the form (a|6|c)* is frequently used to represent a multiset containing the elements a. 6 and c. This DTD is 
very useful since the elements in the multiset are allowed to appear multiple times and in any order in the document. For 
example, the following DTD specifies that the support information for an ad can consist of an arbitrary number of audio or 
video clips, photos, and further these can appear in any order. 

<! ELEMENT support-info (audio-clip | file-id | graphic | logo | 
new-list | photo | video-clip | zz-generic-tag) *> <! — support 
information for ad content — > 

((a6)*c)*: This type of DTD permits nesting relationships among sets (or lists). 

<!ELEMENT transfer- info (treuisfer -number, (from-to, company-id) + , 
contact-info) *> <! — provides parent information through the multi- 
level aggregation process, may be repeated --> 

Although our algorithms can infer regular expressions that are more complex than the above, we do not infer certain complex 
expressions such as (a6?c*d?)* that are less likely to occur in practice. We defer further discussion of this topic to Section 7. 

We now discuss our generalization algorithm which is outlined in Figure 5. Procedure GENERALIZE infers several DTDs for 
each input sequence in / independently and adds them to the set Sg, Therefore, it may over-generalize in some cases (since we 

^These can be accessed at httpyAvww.naa.org/technology/cIsstdlf/AdexO I O.dtd. 
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are inferring DTDs based on a single sequence), but however, our MDL step will ensure that such over-general DTDs are not 
chosen as part of the final inferred DTD. if there are better alternatives. Recall that the generalization step is merely trying to 
provide several alternate candidates to the MDL step. In particular. SgD I. and therefore, the DTD corresponding to the or's of 
the input will be considered by the MDL step. 

The essence of procedure GENERALIZE are the procedures DiscoverSeqPattern and DiscoverOrPattern which are 
repeatedly called with various parameter values. We discuss details of these procedures and the roles of the parameters next. 

5.1 Discovering Sequencing Patterns 

Procedure DiscoverSeqPattern. shown in Figure 5. takes as input an input sequence s and returns a candidate DTD that is 
derived from s by replacing sequencing patterns of the form xx-.-x, for a subsequence x in s, with the regular expression (x)* 
In addition to s. the procedure also accepts as input, a threshold parameter r > 1 which is the minimum number of contiguous 
repetitions of subsequence x in s required for the repetitions to be replaced with (x)*. In case there are multiple subsequences x 
with the maximum number of repetitions in Step 2. the longest among them is chosen, and subsequent ties are resolved arbitrarily 
Note that instead of introducing the regular expression term (x)' into the sequence 3, we choose to introduce an auxiliary 
symbol that serves as a representative for the term. TTie auxiliary symbols enable us to keep the description of our algorithms 
simple and clean since the input to them is always a sequence of symbols. We ensure that there is a one-to-one correspondence be- 
tween auxiliary symbols and regular expression terms throughout the XTRACT system; thus, if the auxilliary symbol. A denotes 
{be)' in one candidate DTD. then it represents (be)* in every other candidate DTD. Also observe diat procedure Discover- 
SeqPattern may perform several iterations and thus new sequencing patterns may contain auxiliary symbols corresponding 
to patterns replaced in previous iterations. For example, invoking procedure DiscoverSeqPattern with the input sequence 
s - abababeababe and r = 2 yields the sequence A.eAre after the first iteration, where A, is an auxiliary symbol for the term 
{ab)'. After the second iteration, the procedure returns the candidate DTD A^. where A^ is the auxiliary symbol corresponding 
;to i{ab)'c)\ Thus, the resulting candidate DTD returned by procedure DiscoverSeqPattern can contain 's nested within 
other 's. Finally, we have chosen to invoke DiscoverSeqPatTERN (from GENERALIZE) with three different values for the 
parameter r to control the eagerness with which we generalize. For example, for the sequence aabbb. DiscoverSeqPattern 
with r = 2 would infer a'b\ while with r = 3. it would infer aab\ In the MDL step, if many other sequences are covered by 
aab\ then a DTD of aafc* may be preferred to a DTD of o*6' since it more accurately describes sequences in /. 

The time complexity of the procedure is dominated by the first step that involves finding the subsequence i with the maxi- 
mum number of contiguous repetitions. Since s contains at most 0{\s\^) possible subsequences and computing the number of 
repetitions for each subsequence takes 0{\s\) steps, the complexity of die first step is 0{\sf) per iteration, in the worst case. 

5.2 Discovering Or Patterns 

Procedure DiscoverOrPattern infers patterns of the form (a, lo,! . . . |o„)- based on the locality of these symbols within a 
sequence s. It finds out such locality by first partitioning (performed by procedure Partition) the input sequence s into the 
smallest possible subsequences si , S2, • • • , s„. such that for any occurrence of a symbol a in a subsequence Si, there does not exist 
Vother occurrence of a in some odier subsequence Sj within a distance d (which is a parameter to DiscoverOrPattern) Each 
subsequence in s is then replaced by the pattern (aM-.. |a^)' where a, ..... are the distinct symbols in the subsequence 
Si. The intuition here is Uiat if contains frequent repetitions of Uie symbols a, , 03, . . . , a„ in close proximity, then it is very 
hkely that s< originated from a regular expression of the form (a, |a2| . . . |a^)*. As an illusUation, on the input sequence abcbac, 
procedure DiscoverOrPattern returns 

• aAiac for d = 2, where Ai = {b\ c)*. 

• a/l2 for d = 3. where >l2 = (a I 6 I c)'. and 

• A2 for d = 4, where ^2 = (o | 6 | e)'. 
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procedure Generalize(/) 
begin 

1 . for each sequence s 'm I 



2. add s to Sg 

3. forr:=2,3,4 

4. s' := DlSCOVERSEQPATTERN(s,r) 

5. ford:=0.1-|s'),0.5-|s'|,|s'| 

6. 5" := DiscoverOrPattern(5', d) 

7. add 3" to Sg 



end 

procedure DlSCOVERSEQPATTERN(s,r) 
begin 

1 . repeat 

2. let X be a subsequence of s with the maximum number (> r) of contiguous repetitions in s 

3. replace all (> r) contiguous occurrences of x in s with a new auxiliary symbol Ai = (x)* 

4. until (3 no longer contains > r contiguous occurrences of any subsequence x) 

5. return 3 
end 

procedure DiscoverOrPattern(s. d) 
begin 

1. Si,S2,...,Sn :=PARTITION(3,d) 



2. for each subsequence 5j in 51,52, ...,s„ 

3. let the set of distinct symbols in sj be 

4. if(m>l) 

5. replace subsequence Sj in sequence 5 by a new auxiliary symbol Ai — {ai\ — tom)* 



6. return 3 
end 

procedure PARTiTiON(s,fO 
begin 

1. i := start end :— 1 

2. Si = s[start^end\ 

3. while (end < |5|) 



4. while (end < |5| and a symbol in Si occurs to the right of Si within a distance d) 

5. end := end + 1; 5< := sffitort, end] 

6. if (end <|5|) 

7. » := t + 1; 5tar< := end + 1; end := end + 1; Si := s[5tort, end] 



8. return 5i, 52, • • ,5i 
end 



Figure 5: The Generalization Algorithm 
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A critical component for discovering or patterns is procedure PARTITION, which we now discuss in more detail. Before 
that, we define the following notation for sequences. For a sequence 5, 3[iJ] denotes the subsequence of s starting at the i*'* 
symbol and ending at the j^h symbol of a. Procedure Partition constructs the subsequences in the order ai, 52. and so on. 
Assuming that sx through 5^- have been generated, it constructs 5j+i by starting Sj^i immediately after sj ends and expanding 
the subsequence s^+i to the right as long as required to ensure that there is no symbol in that occurs within a distance d to 
the right of 5j+i . By construction, there cannot exist such a symbol to the left of Sj^i . Note that the condition whether a symbol 
in Si occurs within a distance d outside Si can be checked in 0(|5|) time if we keep track of the next occurrence outside Si of 
every symbol in Si - this can be achieved by initially constructing for every symbol, the locations of its occurrences in 5 sorted 
order. Therefore, the time complexity of procedures Partition and DiscoverOrPattern can be easily shown to be 0(|sp). 

Note that procedure GENERALIZE invokes DiscoverOrPattern on the DTDs that result from calls to DiscoverSeqPat- 
TERN and therefore it is possible to infer more complex DTDs of the form (a|(6c)*)* in addition to DTDs like (a|6|c)*. For 
instance, for the input sequence s = abcbca, procedure DiscoverSeqPattern invoked with r = 2 would return a' = aAia, 
where Ai = {bc)\ which when input to DiscoverOrPattern returns 5" = A2 ford = |5'|. where A2 = (a|>li)*. Further, 
observe that DiscoverOrPattern is invoked with various values of d (expressed as a fraction of the length of the input se- 
quence) to control the degree of generalization. Small values of d lead to conservative generalizations while larger values result 
in more liberal generalizations. 

6 The Factoring Subsystem 

In a nutshell, the factoring step derives factored forms for expressions that are an or of a subset of the candidate DTDs in Sg. For 
example, for candidate DTDs oc, ad, 6c and 6d in «Sg. the factoring step would generate the factored form (a | b){c | d). Note 
that since the final DTD is an or of candidate DTDs in 5^, factored forms are candidates, too. Further, a factored candidate DTD, 
( J because of its smaller size, has a lower MDL cost, and is thus more likely to be chosen in the MDL step. Thus, since factored 
forms (due to tiieir compactness) are more desirable (see restriction Rl in Section 3), factoring can result in better quality DTDs. 
In this section, we describe the algoritiims used by the factoring module to derive factored forms of the candidate DTDs in Sg 
produced by the generalization step. 

Factored DTDs are common in real life, when diere are several choices to be made. For example, in the DTD in Figure 2, an 
article may be categorized based on wheUier it appeared in a workshop, conference or journal; it may also be classified according 
to its area as belonging to either computer science, physics, chemistry etc. Thus, the DTD (in factored form) for the element 
article would then be as follows: 

<! ELEMENT article (title, author*, (workshop | conference | journal), 
(computer science | physics | chemistry (...)) 

The set of candidate DTDs output by the factorization module, Sjr, in addition to the factored forms generated from candidates 
in Sg, also contains all the DTDs in Sg. Ideally, factored forms for every subset of Sg, should be added to Sjr to be considered 
by the MDL module. However, this is clearly impractical, since Sg could be pretty large. Therefore, in the following subsection, 
[ ) we propose a heuristic for selecting sets of candidates in 5c; that when factored yield good factored DTDs. We then present a 
brief description of tiie factoring algorithm itself, which is an adaptation of factoring algorithms for boolean expressions from the 
logic optimization literature. 

Note that each candidate DTD in Sg is a sequence of symbols, some of which can be auxiliary symbols. Recall that auxiliary 
symbols translate to regular expressions on symbols in E, and there is a one-to-one correspondence between auxiliary symbols 
and the expressions that they represent. 
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procedure Facto rSubsets(5(;) 
begin 

1. for each DTD D in 5c; 

2. Compute $coTe(D,SQ) 

3. 5r := 5' := Sq ; SeedSet := (} 

4. fort:=ltoA: 

5. let D be the DTD in 5' with the maximum value for score(D,SQ) 

6. SeedSet := SeedSet \JD 

7. 5' := 5' - {JD' : overlap{D, D') > 5} 

8. for each DTD D in SeedSet 

9. S := {D} 

10. S' :=Sg-{D' : overlap{D,D') > 6} 

1 1 . while (5' is not empty) 

12. let be the DTD in with the maximum value for score{iy , S) 

13. S — S\JD' 

14. := S' - {!>" : over/ap(I?', I?") > 6} 

15. F:=Factor(5) 

16. 5:r:=5^U{F), . . . /* F = Fi | ■ • • | */ 
end 



Figure 6: Choosing Subsets Of Sg For Factoring 
6.1 Selecting Subsets of Sg to Factor 

In this section, we describe how we choose subsets of Sg that lead to good factorizations. Intuitively, a subset S of Sg is a 
good candidate for factoring if the factored form of S is much smaller than 5 itself. In addition, even though 5^ may contain 
multiple generalizations that are derived from the same input sequence, it is highly unlikely that the hnal DTD will contain two 
generalizations of the same input sequence. Thus, factoring candidate DTDs in Sg that cover similar sets of input sequences does 
not lead to factors that can improve the quality of the final DTD. 

We thus conclude that if a subset 5 of Sg to yield good factored forms it must satisfy the following two properties: 

1. Every DTD in S has a conmion prefix or suffix with a number of other DTDs in 5. Further, as more DTDs in 5 share 
common prefixes or suffixes, or as the length of the common prefixes/suffixes increases, the quality of the generated 
factored form can be expected to improve. 

2. The overlap between every pair of DTDs Z3, D* in S is minimal, that is, the intersection of the input sequences covered 
by D and D* is small. This is important because, as mentioned above, a factored DTD adds little value (from an MDL 
cost perspective) over the candidate DTDs from which it was derived if it cannot be used to encode a significantly larger 
number of input sequences compared to the sequences covered by each individual DTD. 

Definitions. In order to state properties (1) and (2) for a set S of DTDs more formally, we need to first define the following 
notation. For a DTD D, let caveT{D) denote the input sequences in / that are covered by D (note that auxiliary symbols are 
expanded completely when caver for a DTD is computed). Then, overlap{D, D*) is defined as the fraction of the input sequences 
covered by D and D* that are common to D and Z?'. that is, auerlap{D, D') = |co!!eri3Ccov!riD^^^ ^ ^ sufficiently 
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Small value of the (user-specified) parameter 5. by ensuring that overlap{D, D') < 5 for every pair of DTDs D and D' in 5. we 
can ensure that 5 satisfies Property (2) mentioned above. 

In order to characterize Property (1) more rigorously, we introduce the funcUon score(D, S) which attempts to capture the 
degree of similarity between prefixes/suffixes of DTD D and those of DTDs in the set S of DTDs. Intuitively, a DTD with a high 
score with respect to set S is a good candidate to be factored with other DTDs in set S. For a DTD D, let pref{D) and suf(D) 
denote the set of prefixes and suffixes of D, respectively. Ut psup{p, S) denote the support of prefix p in set S of DTDs, that is 
the number of DTDs in S for which p is a prefix. Similarly, let S3up{8, S) denote number of DTDs in S for which s is a suffix 
Then score{D,S) is defined as follows. 

scoreiD, S) = max({|p| • psup(p, S):pe pref(D)} U {\a\ * ssup{s, S) : a e 3uf{D))) 

Thus, the prefix/suffix p/s of D, for which the product ofp/s's length and its support in S is maximum, determines the score 
of D with respect to S. The intuition here is tfiat if DTD Z? has a long prefix or suffix that occurs frequently in set 5, then this 
prefix can be factored out thus resulting in good factored forms. The function scare is thus a good measure of how well D would 
factor with other DTDs in 5. 

Algorithm. Procedure FACTORS UBSBTS, shown in Figure 6. first selects subsets S ofSg to factor that satisfy properties (1) and 
(2) mentioned earlier. Each of these subsets 5 is then factored by invoking procedure Factor (in Step 15) described in the next 
subsection. Assuming that the factoring algorithm returns | F2 | • • - each of the F, is added to Sjr that is then input to the 
MDL module. 

We now discuss how procedure FACTORS UBSETS computes the set S of candidate DTDs to factor. First, k seed DTDs for 
the sets S to be factored are chosen in the for loop spanning steps 4-7. These seed DTDs have a high score value with respect to 
5j; and overlap minimally with each other. Thus, we ensure that each seed DTD not only factors well with other DTDs in Sg 
but IS also is SignificanUy different from other seeds. In steps 9-14. each seed DTD is used to construct a new set S of DTDs to 
be factored (thus, only * sets of DTDs are generated). After initializing 5 to a seed DTO D, in each subsequent iteration, the 
St next DTD D' that is added to 5 is chosen greedily - it is the one whose score with respect to DTDs in S is maximum and whose 
overlap with DTDs abeady in S is less than S. 

Compl«ily Results. The time complexity of selecting the sets 5 to factor in the FactorSubsets procedure can be shown to 
be 0(^2 . ^ ^^g^ ^ ^ ^ maximum length of an input sequence in /. The reason for this is that the 

mitial computation of 3core(D, Sg) for every DTD D in Sg requires us to compute the support of every prefix and suffix of D 
" in Sg . Since Sg contains 0(iV) DTDs, and each DTD can have at most 2L prefixes/suffixes. Uiere are at most 0{N ■ L) distinct 
prefixes and suffixes. The supports for these can be computed in 0{N ■ L) steps by storing them in a trie structure. Thus, the 
time complexity of computing the scores for all the DTDs in Sg (in steps 1-2) is 0{N ■ L). 

Computing the overlap between a pair of DTDs requires 0{N) time to compute the intersection and union of ihe input 
sequences they cover. Thus, the worst-case time complexity to compute the overlap between all pairs of DTDs in Sg is n{N^). 
Assuming that we precompute the overlapping DTD pairs in Sg, SeedSet can be computed in 0{N) steps (since the number of 
)seeds^ k, is a constant). Furthennore. the time complexity of computing each set 5 of DTDs to be factored can be shown to be 
OiN-" ■ L) since the while loop (steps 11-14) performs at most 0(N) iterations and the cost of recomputing the scores for DTDs 
in S' (with respect to 5) in each iteration is 0{N ■ L) (as before, this can be achieved by maintaining a trie structure for prefixes 
and suffixes of DTDs in 5). 

6.2 Algorithm For Factoring a Set of DTDs 

In this section, we show how the factored form for a set 5 of DTDs can be derived - the expression we factor is actually the or 
of the DTDs in 5. Algoritiims for computing the optimum factored form, that is. the one with the minimum number of literals 
have been proposed earlier in [Law641. However, the complexity of these exact techniques are impractical for all but the smallest 
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7.2 Data Sets 



In order to evaluate the quality of DTDs retrieved by XTRACT, we used both synthetic as well as real-life DTD schemas. For each 
DTD for a single element, we generated an XML file containing 1000 instantiations of the element. These 1000 instantiations 
were generated by randomly sampling from the DTD for the element. Thus, the initial set of input sequences / to both XTRACT 
and DDbE contained somewhere between SOO and 1000 sequences (after the elimination of duplicates) conforming to the original 
DTD. 

Synthetic DTD Data Set We used a synthetic data generator to generate the synthetic data sets. Each DTD is randomly chosen 
to have one of the following two forms: Ai \A2\A3 1 • • t>l„ and i4i ^2^43 • i4n. Thus, a DTD has n building blocks where n is 
randomly chosen number between 1 and m&, where mb is an input parameter to the generator that specifies the maximum number 
of building blocks in a DTD. Each building block ^4^ further consists of ni symbols, where n< is randomly chosen to be between 
1 and 7715 (the parameter ms specifies the maximum number of symbols that can be contained in a building block). Each building 
block Ai has one of the following four forms, each of which has an equal probability of occurrence: (1) (ai|a2|a3| . . . |a„.) 
(2) 010203 . . .On, (3) (ai|a2|a3|o4| . . . [onj* (4) (01020304 . . .On,.)*. Here, the o^'s denote subelement symbols. Thus, our 
synthetic data generator essentially generates DTDs containing one level of nesting of regular expression terms. 

In Table 3, we show the synthetic DTDs that we considered in our experiments (note that, in the figure, we only include the 
regular expression corresponding to the DTD). The DTDs were produced using our generator with the input parameters mb and 
ms both set to 5. Note that we use letters from the alphabet as subelement symbols. 



No. 


Original DTD 


1 


abcde\efgh\ij\klm 


2 


{a\b\c\d\fy9h 


3 


{a\b\c\d)*\e 


4 


{abcde)*f 


5 


{abr\cdef\ighir 


6 


abode fig\h\i\j){k\l\m\n\o) 


7 


{a\b\c)d*e*{fghr 


8 


(a|6)(crfe fgyhijklmnopq{r\3y 


9 


{abcdr\ie\f\gr\h\{ijklmr 


10 


a-|(6|c|d|e|/)-|3/i|(i|i|*:)-|(imn)- 



Table 3: Synthetic DTD Data Set 



The ten synthetic DTDs vary in complexity with later DTDs being more complex than the earlier ones. For instance, DTD 1 
does not contain any metacharacters, while DTDs 2 through 5 contain simple sequencing and or patterns. DTD 6 represents a 
DTD in factored form while in DTDs 7 through 10, factors are combined with sequencing and or patterns. 

Real'life DTD Data Set We obtained our real-life DTDs from the Newspaper Association of America (NAA) Classified 
Advertising Standards XML DTD produced by the NAA Classified Advertising Standards Task Force^. We examined this 
real-life DTD data and collected six representative DTDs that are shown in TableS. Of the DTDs shown in the table, the last three 
DTDs are quite interesting. DTD 4 contains the metacharacter ? in conjunction with the metacharacter while DTDs S and 6 
contain two regular expressions with *'s, one nested within the other. 
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7.3 Quality of Inferred DTDs 

Synthetic DTD DataSet The DTDs inferred by XTRACT and DDbE for Ae synthetic data set are presented in Table 3 As 

DtoV" ^ : "'"^ °' ^^'^ ^ ^^"^ --P-^ -curate DTO fl on^; 

DTD 1 wh.ch rs the sunplest DTD containing no metacharacters. Even for the simple I>rDs 2-5. not only is DDbEu^ble o 
c^^^^^^^ deduce the oHgina, DTD. ^. it also infers a DTD that does not cover the set of input sequences/F:r?„sle :ne f 

rtZr ' I " ^™ '° ^"''•'^l"'^)' -"•^^ Thus, the DDbE tool Z 

- a tendency to over-generahze when the original DTDs contain regular expressions with 's. TT^is same trend to over-gener^« 
can be seen m DTDs 8-10 also. On the other hand, as is evident from Table 3. this is not the case for XTRACT which clectW 

TorTZ TTTI """^^ "^^^ ^^"""^^ •^^^ --•'--ions Of seq eTc 

and or patterns. ITjts clearly demonstrates the effectiveness of our generalization module in discovering these patterns and ou! 
MDL module in selecting these general candidate DTDs as the final DTDs 

thel'L' Ir'Tf T n^'^ "^DbE is unable to derive 
the find factored form for DTD 6. Finally. DDbE infers an extremely complex DTD for the simple DTD 7. The results for 

Z^ T::^''^'' '''^''^'''''' ^PP™^^" ''^^ - --^'-^ of gene^ Iti „ 

factoring and the MDL principle) compared to DDbE's for the problem of inferring DTDs. 

DTDs. XTOACT .s able to mfer the first five correctly. In contrast. DDbE is able to derive the accurate DTD only for DTDs 1 
and 2. and an approximate DTO for DTD 3. Basically, with an additional factoring step. DDbE could obtain the original DTD for 
DTO 3 Note, however, that DDbE is unable to infer the simple DTD 4 that contains the metacharacter ?. In contrast. X-mACT 

the form l|a to o? DTD 5 represents an interesting case where XTRACT is able to mine a DTD containing regular expressions 
cont^nmg nested s. This is due to our generalization module that iteraUvely looks for sequencing patterns. On the other hand 
DDbE simply over-generalizes DTD 5 by oring al l the symbols in it and enclosing them within the metacharacter Finally.' 
'This can be accessed at http7/www.naa.org/technology/clsstdtfirAdexOIO.dtd 
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DTD Inferred hv XTR APT 


DTD Inferred bv DDbE 


1 
1 




fih/*fiff\fi f /i/ilV 1* 1 Iclm 


nhrAp\fi fnh\i'i\klTTl 


2 


(a|6|c|d|/)-5/i 


(a|6|c|d|/)'fl/i 


gh{a\b\c\d\f)+9h 


J 


/_ij,i„i j\» I- 


(a\o\c\a) \e 


\e\a\c\a\o) gj 


4 


(ofccde)*/ 


(abcdeYf 


(/(o|e|d|c|6)+/) 


5 


(a6)*|cde/|(5W)* 


(abr\cdef\{9hx)* 


cdef{a\b\g\i\h)+cdef 


6 


abcdef{g\n\%\j)[K\l\m\n\o) 


aocaej(g\n\%\j)(K\l\rn\n\o) 


/i(m|/|n|fc|o)|t(o|/|n|m|fc)) 


7 


(a|6|c)d*e*(/5/i)* 


(a|6|c)cre (/ff/i)* 


\\c\b\a)a^e^ \a(y\oa^\c\e^\a^)i\ 

aa \oe ))U\fA9) 1 
c(e+|d+)?|a{e+|d+)?|6(e+|rf^)?) 


Q 
O 


(a\0)(caejg) 
hijklmnopq{r\sy 


\a\Q)\cat]g) 
hijklmnopq{r\s)* 


((((a\h\hi'iabrAp fa\\b\a\ 
{c\g\f\e\d\s\r)*{{b\a)'thijkamnoj>q)) 


9 


{abcdr\{e\f\9r\h\{ijklmr 


{abcdr\{ijklmr\h\{e\f\gr 


Mo|d|c|6|e|<7|/|»|m|t|fc|i)+/i 


10 


a*|(6|c|d|e|/)*|5/i|(i|j|fc)1 
{ImnY 


a*|(6|c|d|e|/)1g/i|(iU|fc)1 
(/mn)* 


(o+|ff/»)(e|/|d|i|j|/|nlm|fc|c|6)+ 
(o+|<7ft) 



Table 5: DTDs generated by XTRACT and DDbE for Synthetic Data Set 



NO 


Simplified DTD 


DTD Obtainied by XTRACT 


DTD obtained by DDbE 


1 


a|b|c|d|e 


a\b\c\d\e 


a|6|c|d|e 


2 


(a|6|c|d|e)* 




(a|6|c|d|e)' 


3 


{ab'c*) 


ab'c' 


iab+c')\{ac*) 


4 


o*6?c?d? 


a'b?c?d? 


(o+6(c|(c?d))?)|((6|o+)?cd)| 
((a+|6)?d)|((o+|6)?c)|a+|6) 


5 


(o(6c)+d)* 


{aibcyd)' 


(a|6|c|d)+ 


6 


(a6?c*d?)* 




(o|6|c|d)+ 



Table 6: DTDs generated by XTRACT and DDbE for Real-life Data Set 



neither XTRACT nor DDbE is able to correctly infer DTD 6. (The approximate DTD derived by XTRACT for DTD 6 is radier 
complex and, therefore, we chose to omit it from Table 6.) The reason for XTRACT's failure is that our generalization subsystem 
does not detect patterns containing the optional symbol ?. Finding such patterns requires a more sophisticated analysis of symbol 
occurrences within and across sequences, and we plan to pursue this further as part of future work. 

8 Conclusions 

In this paper, we presented the architecture of the XTRACT system for inferring a DTD for a database of XML documents. 
The DTD plays the role of a schema and thus contains valuable information about the structure of the XML documents that it 
describes. However, since DTDs are not mandatory, in a number of cases, documents in an XML database may not have an 
accompanying DTD. Thus, the DTD inference problem is important, especially given the critical role that the DTD plays in the 
storage as well as the formulation, optimization and processing of queries on the underiying data. 

The problem of deriving the DTD for a set of documents is complicated by the fact that the DTD syntax incorporates the full 
expressive power of regular expressions. Specifically, as we showed, naive approaches that do not "generalize" beyond the input 
element sequences fail to deduce concise and semantically meaningful DTDs. Instead, XTRACT applies sophisticated algorithms 
in three steps to compute a DTD that is more along the lines that a human would infer. In the first generalization step, patterns 
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r^ore succinc. T.e two steps Js pJ~, ZX^^Zr ""T" '° 

third and final step. XTT^CT employs L MDL prindpL to Ltct fro^^^ 77 f "'^ ^" 

balance between conciseness and preciseness - TtT DTO h , T ^"^ ''^^ '"^^'^ "^^t 

MDLpri„cip.e.aps naturally to the/ac/^^Ln^^^^^^^^ .^^ ^^'"^ ^ ^-ral. T,. 

recently p^posed in the literature. ^ ^ ^'"^ "^^'"^ "^'"^ ^" approximation algorithm 

We compared the quality of the DTDs inferred bv XTRACT .h. ... 
»cHp,„. E.™p,e, DTD „^ao„ .„o, „„ T2^":Z^^„ I "~ '^^ Oe- 
DDbB b, a wid« margin, and fo, most DHJs i, L aM, , , 7 , «l»nme„ts, XTRACT oulperfomrf 

ninnberof „,e DTDs which were J^^Z ™ct ' f ! """^ '""^'""^ 'o <"> »• 

nesttd,e5.h,e,p,es«i„„u™s. nus ou/Z^^TcLrr '"^'^ 
ge^rahza^on and faci^^aUcn ,„ d^™,tf 

aigoHd,n. u. infer even n,ore co„,p,e, DTDs („,a. le eo„^„ C^"' " 
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- Appendix - 



procedure Factor(5) /* 5 is .he set of sequences to be factored */ 
begin 

1. DivisorSet := FindAllDivisors(5) 

2. if (DivisorSet = <^) 

3. return or of sequences in S 

4. DivisorList := 0 

5. for each divisor V in DivisorSet 
6- Q,'2:=DiviDE(5,K) 

7. add {V;Q,i?) to DivisorList 

8. find the most compact triplet ( K , Q,, il,) in DivisorList 

9. return (Factor (V;))(Factor(Qo) | FACTOR(i2,) 
end 

procedure FiNDALLDrvisoRS(5) 
begin 

L DivisorSet := 0 

2. f«'«-hdisu-nctseque„ce,suchU«.,isasuffixfora,Ieas..woelememsi„S 

3. DivisorSet ;= DivisorSet U { {p : ps e 5} } 

4. return DivisorSet 
end 

-procedure Divide(5, V) 
begin 

1 . for each sequence p in K 
2- := {s : p5 e 5} 

3. 0:=np€v?p 

4. i2:=S-VoQ 

/* K o g is the set of sequences resulting from concatenating 
every sequence in Q to the end of every sequence in V */ 

5. return Q, R 
end 



rigure 7: Factoring Algorith 
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TITLE: DOCUMENT DESCRIPTOR EXTRACTION METHOD 

FIELD OF THE INVENTION 

The present invention relates to electronic documents. Specifically, the present 
invention relates to determining document descriptors fi-om data within electronic documents. 

BACKGROUND OF THE INVENTION 

The number of documents available in electronic format has exploded. With the 
number of available electronic documents increasing rapidly, it is important to be able to 
quickly and accurately search the available electronic documents. In addition, it is desirable 
to be able to store data into electronic documents and generate new electronic documents 
which are similar in structure to existing electronic docimients. Hence, tools which assist in 
the querying of electronic documents, the creation of electronic documents, and the storage of 
data into electronic documents are desirable. 

Electronic documents for display over the Internet and/or an Intranet are commonly 
stored in a Standard Generalized Markup Language (SGML) format. SGML is a standard for 
how to specify a document markup language or tag set, SGML is not in itself a document 
language, but a description of how to specify one. The SGML format provides for the 
inclusion of a document type descriptor (DTD). A document's DTD specifies how the data 
within a document should be organized. One SGML format for storing data within electronic 
documents which is becoming increasingly popular is extensible Markup Language (XML). 
XML is rapidly emerging as the new standard for representing and exchanging data on the 
World Wide Web (web). An XML document may be accompanied by a document type 
descriptor (DTD). For example, in an XML document, the DTD may specify the tags which 
can be used, the order in which the tags appear, how the tags are nested, and tag attributes. 



Thus, the DTD plays an important role in the storage of data to the XML document, the 
generation of similar documents, and increasing the efficiency of queries of the XML 
document. Efficiency is achieved by using the knowledge of the structure of the data to 
remove elements that cannot potentially satisfy the query. 

Although DTDs are helpful in the storage, generation, and retrieval of data related to 
an XML document, DTDs are not mandatory. Since DTDs are not mandatory, many XML 
documents exist which do not contain DTDs. Li addition, since only a small portion of the 
electronic documents in existence today are in an XML format, initially the majority of XML 
documents will likely be automatically generated from pre-existing non-XML documents. In 
many instances, the automatically generated XML formatted documents will not contain 
DTDs. Therefore, a tool for automatically generating DTDs is desirable for improving data 
storage and retrieval. 

Others have attempted to automatically generate DTDs v^th varying degrees of 
success. One system is IBM's Data Descriptors by Example (DDbE) system. The goal of 
DDbE is to give users a good start at creating DTDs for their ovm applications. However, 
this system and other available systems do not produce highly accurate DTDs for all XML 
documents, especially complex XML documents. Since accurate DTDs enable efficient 
storage and retrieval of data, improved methods for extracting accurate DTDs from XML 
documents are desirable. 

SUMMARY OF THE INVENTION 

The present invention relates to developing a description of the layout of an electronic 
document from data within the document. The present invention is especially useful for 
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determining document type descriptors (DTDs) of electronic documents in a Standard 
Generalized Markup Language (SGML) format. 

The present invention comprises generalizing input sequences generated from an 
electronic document. The input sequence are generalized to create generalized sequences 
which are representative of the input sequences. Each generalized sequence encompasses one 
or more input sequences in a more general form. Next, the present invention comprises 
selecting a description of the layout of the electronic document from the input sequences and 
generalized sequences. Selecting a description comprises selecting one or more of the input 
sequences and generalized sequences such that every input sequence is encompassed by the 
selected sequences. Preferably, the selection is performed using minimum descriptor length 
(MDL) principles. 

Additionally, the present invention may comprise factoring the input sequences and 
generalized sequences after generalizing to create factored sequences which can be included 
in the selection of the description. Each factored sequence encompasses one or more input 
sequences and generalized sequences. The factored sequence are combined with the input 
sequences and generalized sequences, thereby creating a potentially better selection of 
sequences from which a description may be selected. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a flow chart of a preferred document type descriptor (DTD) extraction 
system in accordance with the present invention; and 

Figure 2 is an illustrative depiction of the output of each step and the selection process 
of the preferred document type descriptor extraction system depicted in Figure 1 in 
accordance with the present invention. 
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DETAILED DESCRIPTION OF THE INVENTION 

The present invention relates to inferring (i.e., determining) document descriptors 
from data within electronic documents. For illustrative purposes, the present invention is 
described in terms of inferring document type descriptors (DTDs) from data within 
extensible Markup Language (XML) formatted documents. However, it will be readily 
apparent to those skilled in the art that the present invention could be applied to other types of 
markup languages which provide document descriptions that are currently available or 
developed in the future, such as markup languages which conform to the Standard General 
Markup Language (SGML) format. The inferred DTD contains valuable information about 
the structure of the XML docimients that it describes. The structural information may be used 
to efficiently query the XML document, store data to the XML document, or generate similar 
XML documents. 

A sample XML document and its associated DTD are as follows: 

Sample XML Document 

<article> 

<title> A Relational Model for Large Shared Data Banks </title> 
<author> 

<name> E. F. Codd </name> 
<affiliation> IBM Research </affiliation> 
</author> 
</article> 
<article> 

<title> XTRACT: A system for Extracting DTDs </title> 
<author> 

<name> M, Garofalakis </name> 

<affiliation> Bell Labs </affiliation> 
</author> 
<author> 

<name> A. Gionis </name> 

<affiliation> Stanford University </afFiliation> 
</author> 
<author> 

<name> R. Rastogi </name> 
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<afFiliation> Bell Labs </afFiliation> 
</author> 
<author> 

<name> S. Seshadri </name> 

<affiliation> Bell Labs </afFiliation> 
</author> 
<author> 

<name> K. Shim </name> 

<afFiliation> Bell Labs </affiliation> 
</author> 
</article> 

Sample Document Type Descriptor (DTD) 

<!ELEMENT article (title, author*)> 
<!ELEMENT title (#PCDATA)> 
<!ELEMENT author (name, affiliation)> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT affiliation (#PCDATA)> 

A DTD describes the structure of an XML document. A DTD constrains the structure 
of an element by specifying a regular expression with which its sub-element sequences must 
conform. The DTD declaration sequence uses commas for sequencing, | for exclusive OR, 
parenthesis for grouping and meta-characters ?, *, + to denote zero or one, zero or more, and 
one or more, respectively. 

In the sample XML document above and its associated DTD, the start of an element 
such as article is indicated by <article> and the end of the element is indicated by </article>. 
Each element may comprise sub-elements and/or data. For example, for the element article, 
title and author are sub-elements. Likewise, sub-elements may further contain additional sub- 
elements. For example, author contains sub-element name and sub-element affiliation. 

In a preferred embodiment, the present invention applies algorithms in three steps to 
compute a DTD firom a set of input sequences. They are (1) generalizing, (2) factoring, and 
(3) selecting. 
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The input sequences are groupings of sub-elements contained within each occurrence 
of an element. For an element, such as article, it is straight forward to compute the sequence 
of sub-elements nested within each <article> </article> pair in the XML document. The set 
of input sequences comprises one sequence for each occurrence of element <article>. For 
example, in the above XML document sample the input sequences for <article> would be 
input sequences <title><author> and <title><author><author><author><author><author>. 
For ease of description, the first letter of the sub-element may be used as a shorthand for 
describing sequences (e.g., <title> <author> is represented by ta and 
<title><author><author><author><author><author> is represented by taaaaa.) 

In the preferred embodiment, the input sequences are generalized to create generalized 
sequences. The generalized sequences and input sequences are then factored to create 
factored sequences. Each factored sequence may encompass one or more input sequences and 
generalized sequences, thereby creating additional sequences which may be selected as a part 
of a DTD. The factoring step is optional. However, using the factoring step results in 
potentially better DTDs. Factoring leads to better DTDs by creating additional sequences 
from which an appropriate DTD may be selected. A DTD which encompasses all of the input 
sequences is then selected from the input sequences, generalized sequences, and factored 
sequences. 

In the generalization step, patterns within the input sequences are detected and more 
"general" regular expressions are substituted for them to create "generalized" sequences. In a 
preferred embodiment, the "generalized" sequences and the input sequences are then 
processed by the factorization step which factors common expressions to make them more 
succinct. The factorization step yields "factored" sequences. The first two steps along v^th 
the input sequences produce a series of potential DTDs that vary in their conciseness and 
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precision. A selection step then selects a DTD from the candidates that strikes the right 
balance between conciseness and preciseness - that is, a DTD that is concise, but at the same 
time, is not too general. In a preferred embodiment, the selection step employs minimum 
descriptor length (MDL) principles for selecting a DTD. 

Figure 1 depicts a flow chart 100 illustrating the steps for inferring a DTD in 
accordance with a preferred embodiment of the present invention. The input sequences I are 
comprised of sub-elements a, b, c, d, and e. The input sequences are first processed by a 
generalization module 110 which produces generalized sequences. The generalized 
sequences are combined with the input sequences to create a set of potential DTDs identified 
by Sg. Optionally, the potential DTDs are factored using a factoring module 120. The 
factoring module produces additional potential DTDs which are combined with the potential 
DTDs output by the generalization module 1 10 to create a set of potential DTDs identified by 
Sp. Finally, the selecting module 130 infers (i.e. selects) a DTD from all of the potential 
DTDs Sp. Preferably, the selecting module 130 incorporates MDL principles. 

Figure 2 graphically depicts the selection of a DTD from all of the potential DTDs. 
The selected DTD must encompass all of the original input sequences Sp. It can be seen that 
(ab)* encompasses input sequences ab and abab. Also, (a|b) (c|d) encompasses input 
sequences ac, ad, be, and bd. Finally, b*(d|e) encompasses bd, bbd, and bbbbe. The selected 
potential DTDs, when combined using ORs, encompass all of the original input sequences. 
The resuh is a concise and precise DTD. 

1. GENERALIZING 

The quality of the data type descriptor (DTD) selected during the selection process is 
very dependent on the set of candidate DTDs available. If the selection were based on the 
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input sequences only, then the final DTD output by the selection step would simply be the OR 
of all the input sequences. For example, in the above XML document sample, the DTD for 
<article> would comprise ta and taaaa (i.e., <title><author> and 

<title><author><author><author><author><author>.) However, this is not a desirable DTD 
since it is neither concise nor intuitive. A more concise and intuitive DTD would be the 
single sequence ta* which encompasses both ta and taaaaa. Thus, in order to infer 
meaningful DTDs, the candidate DTDs should be general. Ideally, each candidate DTD 
encompasses more than one input sequence. The goal of the generalization module 110 is to 
achieve this objective. 

The generalization module 110 of the present invention infers a number of regular 
expressions which have been found to frequently appear in real-life DTDs. Below, are 
examples of regular expressions from real-life DTDs that appear in the Newspaper 
Association of America (NAA) Classified Advertizing Standards XML DTD (found at 
http://www.naa.org/technology/clsstdtf/AdexO 1 0.dtd). 

a* be* : DTDs of this form are generally used to specify tuples with set-valued attributes. 
<!ELEMENT account-info (account-number, sub-account-number* )> <!-- 
Specification for account identification information -> 

(abc)* : This type of DTD is used to represent a set (or a list) of ordered tuples. 

<!ELEMENT days-and-hours (date, time)+> <!- provide times/dates when job fairs 
will be held --> 

(a|b|c)* : The DTD of the form (a|b|c)* is frequently used to represent a multiset containing 
the elements a, b and c. This DTD is very useful since the elements in the multiset are 
allowed to appear multiple times and in any order in the document. For example, the 
following DTD specifies that the support information for an ad can consist of an 
arbitrary number of audio or video clips, photos, and further these can appear in any 
order. 

<!ELEMENT support-info (audio-clip | file-id | graphic | logo | new-list | photo | 
video-clip ] zz-generic-tag)*> <!-- support information for ad content -> 

((ab)* c)* : This type of DTD permits nesting relationships among sets (OR lists). 
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<!ELEMENT transfer-info (transfer-number, (from-to, company-id)+, contact-info) *> 
<!- provides parent information through the multilevel aggregation process, may be 
repeated -> 

Table 1 depicts pseudo code for a preferred generalization algorithm (Procedure 
GENERALIZE). Procedure GENERALIZE infers several DTDs for each input sequence 
independently and adds them to the set Sq. The generalize algorithm may over-generalize in 
some cases (since DTDs are inferred based on a single sequence), hov^ever, the selection step 
in selecting module 130 will ensure that such overly-general DTDs are not chosen as part of 
the final inferred DTD, if there are better altematives. The generalization step will provide 
several altemate candidates in addition to the input sequences for the selection step. 

The algorithm can infer regular expressions that are more complex than the above, 

however, complex expressions, such as (ab?c* d?)*, that are less likely to occur in practice, 

may be excluded. 

procedure GENERALIZE(/) 
begin 

1 . for each sequence 5^ in / 

2. add s to Sg 

3. forr:=2,3,4 

4. := DlSC0VERSEQPATrERN(5, r) 

5. for^/:=0.1 "l^'^O.S'l^'M^'l 

6. 5 " := DISCOVERORP ATTERN(5 d) 

7. add^"to5'g 
end 

procedure DISC0VERSEQPATTERN(5, r) 
begin 

1. repeat 

2. let X be a subsequence of s with the maximum number (> r) of contiguous repetitions in 

s 

3. replace all (> r) contiguous occurrences of % in 5 with a new auxiliary symbol Ai = (x)* 

4. until {s no longer contains > r contiguous occurrences of any subsequence x) 

5. returns 
end 



procedure DlSCOVERSEQPATTERN(5, d) 
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begin 

1. si, S2..... Sn := PARTITI0N(5, d) 

2. for each subsequence Sj in Sj, S2^.., Sn 

3. let the set of distinct symbols in sj be ay, a2,..., Om 

4. if(m>l) 

5. replace subsequence Sj in sequence ^ by a new auxiliary symbol Ai = 

6. returns 
end 

procedure PARTITI0N(5, d) 
begin 

1 . / := start := end := 1 

2. Si\=s[starty end] 

3. while {end < \s\) 

4. while (end <\s\ and a symbol in Si occurs to the right of Si within a distance d) 

5. end := end + 1 ; := s [start, end] 

6. if (end< \s\) 

7. /:=/+!; 5torr := end + 1 ; end := ew^/ + 1 ; 5/ := s [start, end] 

8. return 5y, 52 ", Si 
end 

Table 1 : Generalization Algorithm 
The essence of procedure GENERALIZE are the procedures 
DISCOVERSEQPATTERN and DISCOVERORPATTERN which are repeatedly called with 
predefined parameter values. 



Discovering Sequencing Patterns (Procedure DISCOVERSEQPATTERN) 

Procedure DISCOVERSEQPATTERN, shown in Table 1, takes an input sequence s 
and returns a candidate DTD that is derived from s by replacing sequencing patterns of the 
form XX... X, for a subsequence x in s, with the regular expression (x)*. In addition to s, the 
procedure also accepts as input, a threshold parameter r > 1 which is the minimum number of 
contiguous repetitions of subsequence x in s required for the repetitions to be replaced with 
(x)*. In case there are multiple subsequences x with the maximum number of repetitions in 
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step 2 of procedxire DISCOVERSEQPATTERNS, the longest among them is chosen, and 
subsequent ties are resolved arbitrarily. 

Note that instead of introducing the regular expression term (x)* into the sequence s, 
an auxiliary symbol that serves as a representative for the term is introduced. The use of 
auxiliary symbols enable the description of the algorithms to remain simple and clean since 
the input to them is always a sequence of symbols. In a preferred embodiment, there is a 
one-to-one correspondence between auxiliary symbols and regular expression terms in the 
present invention; thus, if the auxilliary symbol A denotes (be)* in one candidate DTD, then 
it represents (be)* in every other candidate DTD. Also, procedure 

DISCOVERSEQPATTERN may perform several iterations and thus new sequencing patterns 
may contain auxiliary symbols corresponding to patterns replaced in previous iterations. For 
example, invoking procedure DISCOVERSEQPATTERN with the input sequence s = 
abababcababc and r = 2 yields the sequence AicAic after the first iteration, where Ai is an 
auxiliary symbol for the term (ab)*. After the second iteration, the procedure returns the 
candidate DTD A2, where A2 is the auxiliary symbol corresponding to ((ab)* c)*. Thus, the 
resulting candidate DTD returned by procedure DISCOVERSEQPATTERN can contain *s 
nested within other *s. Finally, DISCOVERSEQPATTERN is invoked with three different 
values for the parameter r to control the aggressiveness of the generalization. For example, 
for the sequence aabbb, DISCOVERSEQPATTERN with r = 2 would infer a* b*, while with 
r = 3, it would infer aab*. In the selection step, if many other sequences are encompassed by 
aab*, then a DTD of aab* may be preferred to a DTD of a* b* since it more accurately 
describes the input sequences. 



Discovering OR Patterns (Procedure DISCOVERORPATTERN) 



11 



Procedure DISCOVERORPATTERN, shown in Table 1, infers patterns of the form 
(ai|a2| ... |am)* based on the locality of these symbols within a sequence s. The locality is 
identified by first partitioning (performed by procedure PARTITION, shown in Table 1) the 
input sequence s into the smallest possible subsequences si, 82, Sn, such that for any 
occurrence of a symbol a in a subsequence Si, there does not exist another occurrence of a in 
some other subsequence sj within a distance d (which is a parameter to 
DISCOVERORPATTERN). Each subsequence Si in s is then replaced by the pattern (ai|a2| ... 
|am)* where ai, ... , am are the distinct symbols in the subsequence Si. If Si contains frequent 
repetitions of the symbols ai|a2|...|am in close proximity, then it is very likely that Sj originated 
from a regular expression of the form (ai|a2|...|am)*. For illustrative purposes, for the input 
sequence abcbac, procedure DISCOVERORPATTERN rettims: 

• aAiac for d = 2, where Ai = (b | c)* ; 

• aA2 for d = 3, where A2 = (a ] b | c)* ; and 

• A2 for d = 4, where A2 = (a | b | c)* . 

A preferred component for discovering OR patterns is procedure PARTITION, shown 
in Table 1. For a sequence s, s[ij] denotes the subsequence of s starting at the i^** symbol and 
ending at the j^*^ symbol of s. Procedure PARTITION constructs the subsequences in the 
order si, S2, and so on. Assuming that S\ through sj have been generated, it constructs Sj+i by 
starting sj+i immediately after sj ends and expanding the subsequence Sj+i to the right as long 
as required to ensure that there is no symbol in sj+i that occurs within a distance d to the right 
of Sj+i. By construction, there cannot exist such a symbol to the left of Sj+i. 

Note that procedure GENERALIZE invokes DISCOVERORPATTERN on the DTDs 
that result from calls to DISCOVERSEQPATTERN and therefore it is possible to infer more 
complex DTDs of the form (a|(bc)* )* in addition to DTDs like (a|b|c)*. For instance, for the 
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input sequence s = abcbca, procedure DISCOVERSEQPATTERN invoked with r = 2 would 
return s' = aAja, where Ai = (be)* , which, when input to DISCOVERORPATTERN returns 
s" = A2 for d - |s'l, where A2 = (a|Ai)*. Further, DISCOVERORPATTERN is invoked with 
various values of d (expressed as a fraction of the length of the input sequence) to control the 
degree of generalization. Small values of d lead to conservative generalizations while larger 
values result in more liberal generalizations. The size of d is based on desired design 
characteristics. 

IL FACTORING 

In a preferred embodiment, the factoring module 120 uses a factoring step to derive 
factored forms for expressions that are an OR of a subset of the candidate DTDs, So, out of 
the generalization module 110. For example, for candidate DTDs ac, ad, be and bd in So, the 
factoring step would generate the factored form (a | b)(c | d). Note that since the final DTD is 
an OR of candidate DTDs, Sf, out of the factoring module 120, the factored forms are also 
candidates. Further, a factored candidate DTD, because of its smaller size, has a lower 
minimum description length (MDL) cost, and is thus more likely to be chosen in the selection 
step, if MDL principles are used. Thus, since factored forms (due to their compactness) are 
more desirable, factoring can result in better quality DTDs. 

Factored DTDs are common in real life. For example, in the sample DTD, an article 

may be categorized based on whether it appeared in a workshop, conference or journal; it may 

also be classified according to its area as belonging to either computer science, physics, 

chemistry etc. Thus, the DTD (in factored form) for the element article would be as follows: 

<!ELEMENT article(title, author*, (workshop | conference | joumal), 
(computer science | physics | chemistry | ...)) 
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The set of candidate DTDs, Sf, output by the factorization module, 120, in addition to 
the factored forms generated from candidates in So, also contains all the DTDs in Sq. Ideally, 
factored forms for every subset of So, should be added to Sf to be considered by the selection 
step. However, this may be impractical, since Sg could be very large. Therefore, a heuristic 
may be used to select subsets of candidates in Sg that when factored yield good factored 
DTDs. In a preferred embodiment, the factoring algorithm is an adaptation of factoring 
algorithms for boolean expressions which are well known in the art. 

Selecting Subsets of Sn to Factor 

Intuitively, a subset S of Sg out of generalization module 1 10 is a good candidate for 
factoring if the factored form of S is much smaller than S itself. In addition, even though Sq 
may contain multiple generalizations that are derived from the same input sequence, it is 
highly unlikely that the final DTD will contain two generalizations of the same input 
sequence. Thus, factoring candidate DTDs in Sg that encompass similar sets of input 
sequences does not lead to factors that can improve the quality of the final DTD. 

For a subset S of Sg to yield good factored forms it must satisfy the following two 
properties: 

(1.) Every DTD in S has a common prefix or suffix with a number of 
other DTDs in S. Further, as more DTDs in S share conmion prefixes or suffixes, 
or as the length of the common prefixes/suffixes increases, the quality of the 
generated factored form can be expected to improve. 

(2.) The overlap between every pair of DTDs D; D' in S is minimal, that 
is, the intersection of the input sequences encompassed by D and D' is small. This 
is important because, as mentioned above, a factored DTD adds little value (from 
an MDL cost perspective) over the candidate DTDs from which it was derived if 
it cannot be used to encode a significantly larger number of input sequences 
compared to the sequences encompassed by each individual DTD. 
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In order to state properties (1) and (2) for a set S of DTDs more formally. The 
following notation is used. For a DTD D, let cover(D) denote the input sequences in I that are 
encompassed by D (note that auxiliary symbols are expanded completely when cover for a 
DTD is computed). Then, overlap(D, D') is defined as the fraction of the input sequences 
encompassed by D and D' that are common to D and D', that is, 

(1) 

ThuSj for a sufficiently small value of a (user-specified) parameter 5, by ensuring that 
overlap(D,D') < 6 for every pair of DTDs D and D' in S, it can be ensured that S satisfies 
property (2) mentioned above. 

In order to characterize property (1) more rigorously, the function score(D,S) is 
introduced in equation 2. Function score (D, S) attempts to capture the degree of similarity 
between prefixes/suffixes of DTD D and those of DTDs in the set S of DTDs. Intuitively, a 
DTD with a high score with respect to set S is a good candidate to be factored with other 
DTDs in set S. For a DTD D, let pref (D) and suf(D) denote the set of prefixes and suffixes 
of D, respectively. Let psup(p,S) denote the support of prefix p in set S of DTDs, that is, the 
number of DTDs in S for which p is a prefix. Similarly, let ssup(s,S) denote number of DTDs 
in S for which s is a suffix. Then score(D,S) is defined as follows: 

score(D,S) = max({|p| . psup(p,S) : p e pref (D)}u{|s| * ssup(s,S) : s e suf (D)}) (2) 
Thus, the prefix/suffix p=s of D, for which the product of p=s*s length and its support 
in S is maximum, determines the score of D with respect to S. If DTD D has a long prefix or 
suffix that occurs frequently in set S, then this prefix can be factored out, thus resulting in 
good factored forms. The function score is thus a good measure of how well D would factor 
with other DTDs in S. 
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Procedure FACTORSUBSETS, shown in Table 2, first selects subsets S of sequences 
from within sequences So that satisfy properties (1) and (2). Each of these subsets S is then 
factored by invoking procedure FACTOR (in Step 15), depicted in Table 3. Assuming that the 
factoring algorithm returns Fi | F2 | ... Fm, each of the Fj is added to Sp. 

procedure FACTORSUBSETS(5'g) 
begin 

1. for each DTD D is 5g 

2 . Compute score (D, Sg) 

3. Sf- S' := Sg; SeedSet := 0 

4. for / :=\tok 

5. let £) be the DTD in iS" with the maximum value for score (D,Sg) 

6. SeedSet := SeedSet u D 

7. 5' := S' - {D' : overlap (A D") ^5} 

8. for each DTD D in SeedSet 

9. S := {D} 

10. 5' := Sg - {D' : overlap (A D") ^5} 

1 1 . while (5' is not empty) 

1 2. let D' be the DTD in S" with the maximum value for score (D',S) 

13. S:=SuD' 

14. 5' := 5' - {D' : overlap {D\ Z)") ^6} 

15. F:=FACTOR(5) 

16. SF:-SFu{Fj;„,,Fm] /* 
end 

Table 2: Choosing Subsets Of Sg For Factoring 

Procedure FACTORSUBSETS computes a set S of candidate DTDs to factor. First, k 
seed DTDs for the sets S to be factored are chosen in the for loop spanning steps 4-7. These 
seed DTDs have a high score value with respect to So and overlap minimally with each other. 
Thus, it is ensured that each seed DTD not only factors well with other DTDs in Sg? but is 
also significantly different from other seeds, hi steps 9-14, each seed DTD is used to construct 
a new set S of DTDs to be factored (thus, only k sets of DTDs are generated). After 
initializing S to a seed DTD D, in each subsequent iteration, the next DTD D' that is added to 
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S is chosen greedily (i.e., the one whose score with respect to DTDs in S is maximum and 
whose overlap with DTDs already in S is less than 5). 



Algorithm For Factoring a Set of DTDs 

Algorithms for computing the optimum factored form, that is, the one with the 
minimum number of literals are known in the art. However, the complexity of these known 
techniques may be impractical. In a preferred embodiment, heuristic factoring algorithms for 
boolean functions which are known in the art are adapted for use in the present invention. 
Factored forms of boolean functions are commonly used in VLSI design. 

There is a close correspondence between the semantics of DTDs and those of boolean 
expressions. The sequencing operator (,) in DTDs is similar to a logical AND in boolean 
algebra, while the OR operator (|) is like a logical OR. However, there exist certain 
fundamental differences between DTDs and boolean expressions. First, while the logical 
AND operator in boolean logic is commutative, the sequencing operator in DTDs is not (the 
ordering of symbols in a sequence matters!). Second, in boolean logic, the expression a | ab 
is equivalent to a; however, the equivalent DTD for a | ab is ab?. The boolean algorithms can 
be modified to create a factoring algorithm to handle the semantics of the DTDs. The 
pseudo-code for procedure FACTOR, is shown in Table 3. Procedure FACTOR is a 
preferred embodiment of the factoring algorithm used in factoring module 120. 

procedure FACTOR(5) /* S is the set of sequences to be factored */ 
begin 

1. DivisorSet := FlNDALLDlVISORS(^ 

17 



2. if (DivisorSet = 0) 

3. return or of sequences in S 

4. DivisorList := 0 

5. for each divisor Vin DivisorSet 

6. 0 R := DIVIDE(5, V) 

7. add (F, R) to DivisorList 

8. find the most compact triplet (Vj Qi Ri) in DivisorList 

9. return (FACTOR(F/))(FACTOR(e/)) | FACTOR(/?0 
end 



procedure FlNDALLDlVISORS(5) 
begin 

L DivisorSet := 0 

2. for each distinct sequence s such that 5 is a suffix for at least two elements in S 

3. DivisorSet := DivisorSet u {{p:pssS}} 

4. return DivisorSet 
end 

procedure DIVIDECS', V) 
begin 

1 . for each sequence p and V 

2. qp := {s :pss S} 

3. g:=npevqp 

4. R :=S-VoQ 

/* Vo Qis the set of sequences resulting from concatenating 
every sequence in Q to the end of every sequence in V */ 

5. return Q, R 
end 

Table 3: Factoring Algorithm 
As an example of the factoring algorithm, consider the set S = {b, c, ab, ac, df, dg, ef, 
eg} of input sequences corresponding to the expression b|c|ab|ac|dfldg|efleg whose factored 
form is a?(b|c)|(d|e)(flg). Before the steps that procedure FACTOR performs to derive the 
factored form are discussed, the DIVIDE operation that constitutes the core of the factoring 
algorithm is introduced. For sets of sequences S, V, DIVIDE(S,V) returns a quotient Q and 
remainder V such that S = V o Q u R (here, V o Q is the set of sequences resulting from 
concatenating every sequence in Q to the end of every sequence in V). Thus, for the above 
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set S and V = {d,e}, DIVIDE(S,V) returns the quotient Q = {f,g} and remainder R = 
{b,c,ab,ac}. The steps executed by FACTOR to generate the factored form are as follows: 

(1.) Compute set of potential divisors for S . These are simply sets of 
prefixes that have a common suffix in S. Thus, potential divisors for S include {d, 
e} (both f and g are common suffixes) and {l,a} (both b and c are common 
suffixes). The symbol " 1 " is special and denotes the identity symbol v^th respect 
to the sequencing operator, that is, Is = si = s for every sequence s. 

(2.) Choose divisor V from set of potential divisors . This is carried out by 
first dividing S by each potential divisor V to obtain a quotient Q and remainder 
R, and then selecting the V for v^hich the triplet (V,Q,R) has the smallest size. In 
our case, V = {d,e} results in a smaller quotient and remainder (Q = {f, g}, R = 
{b, c, ab, ac}) than {l,a} (Q = {b,c}, R = {df,dg,ef,eg}) and is thus chosen. 

(3.) Recursively factor V. O. and R . The final factored form is 
FACTOR(V)FACTOR(Q)|FACTOR(R), where V = {dje}, Q = {f,g} and R = 
{b,c,ab,ac}. Here, V and Q cannot be factored fiirther since they have no 
divisors. Thus, FACTOR(V ) - (d | e) and FACTOR(Q) = (f | g). However, R can 
be factored more since { 1 ,a} is a divisor. Thus, repeating the above steps on R, 
we obtain FACTOR(R) = (l|a)(b|c). Thus, the final factored form is 
(l|a)(b|c)|(d|e)(flg). 

(4.) Simplify final expression by eliminating "1" . The term (1 1 a) in the 
final expression can be fiirther simplified to a?. Thus, we obtain the desired 
factored form for S. 



III. SELECTING 

The step of selecting comprises selecting a DTD. In a preferred embodiment, the 
DTD comprises one or more sequences from the input sequences, generalized sequences, and 
factored sequences. Alternatively, the DTD may be selected from the input sequences and 
generalized sequences if a factoring step is not used. In a preferred embodiment the step of 
selecting is implemented using minimum descriptor length (MDL) principles. 



The MDL cost of a DTD that is used to weigh a set of sequences, is comprised of: 

(A) the length, in bits, needed to describe the DTD, and 

(B) the length of the sequences, in bits, when encoded in terms of the DTD. 
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First, the number of bits required to describe the DTD is estimated (part (A) of the 
MDL cost). Let S be the set of subelement symbols that appear in sequences in 1. Let M be 
the set of metacharacters |,* , +, ?, (, ). Let the length of a DTD viewed as a string in E u M, 
be n. Then, the length of the DTD in bits is n logdS] + |M|). As an example, let S consist of 
the elements a and b. The length in bits of the DTD a* b* is 4 * log(2 + 6) = 12. Similarly, 
the length in bits of the DTD (ab|abb)(aa|ab* ) is 16 * 3 = 48. 

The Encoding Scheme comprises the following steps: 

(A) seq(D, 5) = e if D = 5-. In this case, DTD D is a sequence of symbols from the 
alphabet E and does not contain any metacharacters. 

(B) seq(Dj,..Dk sj...Sk)~ (Dj, si)...seq{Dk, Sk) that is, D is the concatenation of 
regular expressions Di.,,Dk, and the sequence s can be written as the 
concatenation of the subsequences si...Sk, such that each subsequence Sj 
matches the corresponding regular expression Z),. 

(C) seq{Dj | ... | Dm, s) = / seq{Du s) that is, D is the exclusive choice of regular 
expressions Dj.^Dm, and / is the index of the regular expression that the 
sequence s matches. Note that we need flog m\ bits to encode the index /. 



(D) seq{D*si,.,Sk)^ 

In other words, the sequence 5 = ^/...^a is produced from D* by instantiating 
the repetition operator k times, and each subsequence Si matches the /-th instantiation. 
In this case, since there is no simple and inexpensive way to bound apriori, the 
number of bits required for the index k, we first specify the number of bits required to 
encode k in unary (that is, a sequence of [log /f| Is, followed by a 0) and then the index 
k using flog AT] bits. The 0 in the middle serves as the delimiter between the unary 
encoding of the length of the index and actual index itself 

Table 4: Encoding Scheme 
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The MDL subsystem is responsible for choosing a set S of candidate DTDs from Sf 
such that the final DTD D (which is a logic OR of the DTDs in S) (1) encompasses all 
sequences in I, and (2) has the minimum MDL cost. 

Next, the scheme for encoding a sequence using a DTD (part (B) of the MDL cost) is 
determined. The encoding scheme constructs a sequence of integral indices (which forms the 
encoding) for expressing a sequence in terms of a DTD. The following simple examples 
illustrate the basic building blocks on which the encoding scheme for more complex DTDs is 
built: 

(1.) The encoding for the sequence a in terms of the DTD a is the empty string e. 
(2.) The encoding for the sequence b in terms of the DTD a | b | c is the integral index 
1 (denotes that b is at position 1, counting from 0, in the above DTD). 
(3.) The encoding for the sequence bbb in terms of the DTD b* is the integral index 3 

(denotes 3 repetitions of b). 

Next, the encoding scheme for arbitrary DTDs and arbitrary sequences is generalized. 
The sequence of integral indices for a sequence s when encoded is denoted in terms of a 
DTD D by seq(D,s), We define seq(D,s) recursively in terms of component DTDs within D as 
shown in Table 4. Thus, seq(D,s) can be computed using a recursive procedure based on the 
encoding scheme of the factoring algorithm depicted in Table 4. Note that the definitions of 
the encodings for operators -i- and ? have not been provided since these can be defined in a 
similar fashion to * (for +, k is always greater than 0, while for ?, k can only assume values 1 
or 0). 

Next the encoding scheme is illustrated using the following example. Consider the 
DTD (ab|c)* (de|f g* ) and the sequence abccabfggg to be encoded in terms of the DTD. 
Below, we list how steps (A), (B), (C) and (D) in Table 4 are recursively applied to derive the 
encoding seq((ab|c)* (de|f g* ); abccabf ggg). 
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1. Apply Step (B). seq((ab|c)* ; abccab))seq((de | f g* ); f ggg) 

2. Apply Step (D). 4 seq(ab|c5 ab) seq(ab|c, c) seq(ab|c, c) seq(ab|c, ab) seq((de|f 
g*); f ggg) 

3. Apply Step (C). 4 0 seq(ab, ab) 1 SQq(c, c) 1 seq(c, c) 0 seq(ab, ab) 1 seq(f g*, f 
ggg) 

4. Apply Step (A). 4 0 1 1 0 1 seq(f g*, f ggg) 

5. Apply Steps (A), (B) and (D). 4 0 110 13 

In order to derive the final bit sequence corresponding to the above indices, the xmary 
representation for the number of bits required to encode the indices 4 and 3 is included in the 
encoding. Thus, the following bit encoding for the sequence is obtained: 

seq((ab|c)* (de| fg*), abccabfggg) = 1110100 0 1 1 0 1 11011 

In steps (B), (C) and (D), of the encoding scheme it needs to be determined if a 
sequence s matches a DTD D. Since a DTD is a regular expression, known techniques for 
finding out if a sequence is encompassed by a regular expression can be used. These known 
methods involve constructing a non-deterministic finite automaton for D and can also be used 
to decompose the sequence s into subsequences such that each subsequence matches the 
corresponding sub-part of the DTD D, thus enabling the encoding to be determined. 

Note that there may be multiple ways of partitioning the sequence s such that each 
subsequence matches the corresponding sub-part of the DTD D. In such a case, the above 
procedure can be extended to enumerate every decomposition of s that match sub-parts of D, 
and then select from among the decompositions, the one that results in the minimum length 
encoding of s in terms of D. 

Computing the DTD with Minimum MDL Cost 

Next, the final DTD D (which is a logic OR of a subset S of candidate DTDs in Sf ) 
that encompasses all the input sequences and whose MDL cost for encoding the input 
sequences is minimum is computed. The minimization problem maps naturally to the Facility 
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Location Problem (FLP). The Facility Location Problem is well known in the art. The FLP is 
formulated as follows: Let C be a set of customers and J be a set of facilities such that the 
facilities "serves" every customer. There is a cost c(j) of "choosing" a facility j 8 J and a cost 
d(j, i) of serving customers i 8 C by facility j 8 J. The problem definition asks to choose a 
subset of facilities F c J such that the sum of costs of the facilities plus the sum of costs of 
serving every client by its closest chosen facility is minimized, that is 



The problem of inferring the minimum MDL cost DTD can be reduced to the FLP as 
follows: Let C be the set input sequences and J be the set of candidate DTDs in Sf . The cost 
of choosing a facility is the length of the corresponding candidate DTD. The cost of serving 
client i from facility], d(j, i), is the length of the encoding of the sequence corresponding to i 
using the DTD corresponding to the facility j. If a DTD j does not encompass a sequence i, 
then we set d(j, i) to L Thus, the set F computed by the FLP corresponds to the desired set S 
of candidate DTDs. Algorithms for solving the FLP are well known in the art. In a preferred 
embodiment, a randomized algorithm is employed to approximate the FLP. 

Having thus described a few particular embodiments of the invention, various 
alterations, modifications, and improvements will readily occur to those skilled in the art. 
Such alterations, modifications and improvements as are made obvious by this disclosure are 
intended to be part of this description though not expressly stated herein, and are intended to 
be within the spirit and scope of the invention. Accordingly, the foregoing description is by 
way of example only, and not limiting. The invention is limited only as defined in the 
following claims and equivalents thereto. 
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EXPERIMENTAL RESULTS 

In order to determine the effectiveness of the present invention for inferring the DTD 
of a database of XML documents, v^e conducted a study with both synthetic and real-Ufe 
DTDs. We also compared the DTDs produced by a DTD extraction tool (XTRACT) in 
accordance with a preferred embodiment of the present invention with those generated by the 
IBM alphaworks DTD extraction tool, DDbE (Data Description by Example), for XML data 
(the DDbE tool and a detailed description of it is available at 

http://www.alphaworks.ibm.com/). The results indicate that XTRACT outperforms DDbE 
over a wide range of DTDs, and accurately finds almost every original DTD while DDbE fails 
to do so for most DTDs. Thus, the results clearly demonstrate the effectiveness of 
XTRACT's approach that employs generalization and factorization to derive a range of 
general and concise candidate DTDs, and then uses the MDL principle as the basis to select 
from amongst them. 

The two DTD extraction algorithms considered in the experimental study are as 
follows: 

XTRACT: XTRACT includes all three steps for determining a DTD in 
accordance with the present invention. In the generalization step, we discover 
both sequencing and OR pattems using procedure GENERALIZE. In the 
factoring step, k = ^/\o subsets are chosen for factoring and the parameter q is 
set to 0 in the procedure FACTORSUBSETS. Finally, in the selection step, 
we employ an algorithm which incorporate MDL principles to compute an 
approximation to the facility location problem (FLP). 

DDbE: We used Version 1 .0 of the DDbE DTD extraction tool in the 
experiments. DDbE is a Java component library for inferring a DTD from a 
data set consisting of well-formed XML instances. DDbE offers parameters 
which permit the user to control the structure of the content models and the 
types used for attribute declarations. Some of the important parameters of 
DDbE that we used in the experiments, along with their default values, are 
presented in Table 5. 
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Parameter 


Meaning 


Default 


c 


Maximum number of consecutive identical tokens not replaced by 

a list 


1 


d 


Maximum depth of factorization 


2 



Table 5: Description of Parameters Used by DDbE 



The parameter c specifies the maximum number of consecutive identical tokens that 
should not be replaced by a list. For example, the defauh value of this parameter is 1 and thus 
all sequences containing two or more repetitions of the same symbol are replaced v^th a 
positive list. That is, aa is substituted by a+. The parameter d determines the number of 
applications of factoring. For a set of input sequences that conform to the DTD of 
a(b|c|d)(e|f|g)h, for increasing values of the parameter d, DDbE returns the DTDs in Table 
6. 



Parameter Value (d) 


DTD Obtained 


1 


(acg|ace|adf|abg|abe|acfladglade|abf)h 


2 


a(cg|ce|dflbg|be|cfldg|de|bf)h 


3 


a((c|b|d)g|(d|c|b)fl(c|b|d)e)h 


4 


a((c|b|d)g|(d|c|b)fl(c|b|d)e)h 



Table 6: DTDs generated by DDbE for Increasing Values of Parameter d 
As shown in Table 6, for d = 1 , factorization is performed once in which the rightmost 
symbol h is factored out. When the value of d becomes 2, the leftmost symbol a is also 
factored out. A further increase in the value of d to 3 causes factorization to be performed on 
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the middle portion of the expression and the common expression (b|c|d) to be extracted. 
However, note that subsequent increases in the value of d (beyond 3) do not result in further 
changes to the DTD. This seems to be a limitation of DDbEs factoring algorithm since 
examining the DTD for d = 3, v^e can easily notice that e, f and g have a common factor of 
(b I c I d) with different placement of the symbols v^thin the parenthesis. However, the current 
version of DDbE cannot factorize this further. 

In order to evaluate the quality of DTDs retrieved by XTRACT, we used both 
synthetic as well as real-life DTD schemas. For each DTD for a single element, we generated 
an XML file containing 1000 instantiations of the element. These 1000 instantiations were 
generated by randomly sampling from the DTD for the element. Thus, the initial set of input 
sequences I to both XTRACT and DDbE contained somewhere between 500 and 1000 
sequences (after the elimination of duplicates) conforming to the original DTD, 

THE DATA 

Synthetic DTD Data Set : We used a synthetic data generator to generate the synthetic 
data sets. Each DTD is randomly chosen to have one of the following two forms: 
Al|A2|A3|Anand A1A2A3 ... An. Thus, a DTD has n building blocks where n is a randomly 
chosen number between 1 and mb, where mb is an input parameter to the generator that 
specifies the maximum number of building blocks in a DTD. Each building block Ai further 
consists of Ui symbols, where ni is randomly chosen to be between 1 and ms (the parameter 
ms specifies the maximum number of symbols that can be contained in a building block). 
Each building block Ai has one of the following four forms, each of which has an equal 
probability of occurrence: (1) (al|a2|a3| ... |ani) (2) ala2a3 ... ani (3) (al|a2|a3|a4| ... |ani)* (4) 
(ala2a3a4 ... ani )*. Here, the ai*s denote subelement symbols. Thus, the synthetic data 
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generator essentially generates DTDs containing one level of nesting of regular expression 
terms. 

In Table 7, we show the synthetic DTDs that we considered in the experiments (note 
that, in Table 7, we only include the regular expression corresponding to the DTD). The 
DTDs were produced using the generator with the input parameters mb and ms both set to 5. 
Note that we use letters from the alphabet as subelement symbols. 



No. 


Original DTD 


1 


abcde|ef gh|ij|klm 


2 


(a|b|c|d|f )* gh 


3 


(a|b|c|d)* |e 


4 


(abcde)* f 


5 


(ab)* Icdef |(ghi)* 


6 


abcdef(g|h|ilj)(k|l|m|n|o) 


7 


(a|b|c)d* e* (f gh)* 


8 


(a|b)(cdefg)* hijklmnopq(r|s)* 


9 


(abed)* |(e|f|g)* |h|(ijklm)* 


10 


a* |(b|c|d|elf )* |gh|(i[j|k)* |(lmn)* 



Table 7: Synthetic DTD Data Set 
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The ten synthetic DTDs vary in complexity with later DTDs being more complex than 
the earlier ones. For instance, DTD 1 does not contain any metacharacters, while DTDs 2 
through 5 contain simple sequencing and OR patterns. DTD 6 represents a DTD in factored 
form while in DTDs 7 through 10, factors are combined with sequencing and OR patterns. 

Real-life DTD Data Set: We obtained the real-life DTDs from the Newspaper 
Association of America (NAA) Classified Advertising Standards XML DTD produced by the 
NAA Classified Advertising Standards Task Force (this can be accessed at 
http://www.naa.org/technology/clsstdtf/Adex010.dtd). We examined this real-life DTD data 
and collected six representative DTDs that are shown in Table 8. Of the DTDs shown in the 
table, the last three DTDs are quite interesting. DTD 4 contains the metacharacter ? in 
conjunction with the metacharacter *, while DTDs 5 and 6 contain two regular expressions 
with * 's, one nested within the other. 



No. 


Original DTD 


Simplified DTD 


1 


<ENTITY % included-elements 
"audio-clip | blind-box-reply | graphic | linkpi-char | video-clip"> 


a|b|c|d|e 


2 


<ELEMENT communications-contacts 
(phone 1 faxjemail | pager | web-page)*> 


(a|b|c|d|e)* 


3 


<ELEMENT employment-services(employment-service.type; 
employment-service. location * (e.zz-generic-tag)* )> 


ab* c* 


4 


<ENTITY % location"addr* , geographic-area?, city?, 
state-province?,postal-code?, country?"> 


a* b?c?d? 


5 


<ELEMENT transfer-info(transfer-number; (jfrom-to, 
company-id)+,contact-info)*> 


(a(bc)+d)* 


6 


<ELEMENT real-estate-services(real-estate-service.type, 
real-estate-service.location?, r-e.response-modes*> 
r-e.comment?)* ? 


(ab?c* d?)* 



Table 8: Real-life DTD Data Set 
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QUALITY OF INFERRED DTDS 

Synthetic DTD Data Set : The DTDs inferred by XTRACT and DDbE for the synthetic 
data set are presented in Table 9. As shown in the table, XTRACT infers each of the original 
DTDs correctly. In contrast, DDbE computes the accurate DTD for only DTD 1 which is the 
simplest DTD containing no metacharacters. Even for the simple DTDs 2-5, not only is 
DDbE unable to correctly deduce the original DTD, but it also infers a DTD that does not 
encompass the set of input sequences. For instance, one of the input sequences encompassed 
by DTD 2 is gh which is not encompassed by the DTD inferred by DDbE. Thus, while 
XTRACT infers a DTD that encompasses all the input sequences, the DTD returned by 
DDbE may not encompass every input sequence. DTD 4 exemplifies the two typical 
behaviors of DDbE - (1) sequence f that is not frequently repeated is appended to both the 
front and the back of the final DTD, and (2) symbols that are repeated frequently are all ORd 
together and encapsulated by the metacharacter +. For example, DDbE incorrectly identifies 
the term (abcde)* to be (a|b|c|d|e)* which is much more general. Thus, the DDbE tool has a 
tendency to over-generalize when the original DTDs contain regular expressions with * s. 
This same trend to over-generalize can be seen in DTDs 8-10 also. On the other hand, as is 
evident from Table 9, this is not the case for XTRACT which correctly infers every one of the 
original DTDs even for the more complex DTDs 8-10 that contain various combinations of 
sequencing and OR patterns. This clearly demonstrates the effectiveness of the generalization 
module in discovering these pattems and the MDL module in selecting these general 
candidate DTDs as the final DTDs. 

Also, as discussed earlier, DDbE is not very good at factoring DTDs. For instance, 
unlike XTRACT, DDbE is unable to derive the final factored form for DTD 6. Finally, 
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DDbE infers an extremely complex DTD for the simple DTD 7. The results for the synthetic 
data set clearly demonstrate the superiority of XTRACT's approach (based on the 
combination of generalizing, factoring, and selecting using MDL principles) compared to 
DDbE's for the problem of inferring DTDs. 

Real-life DTD Data Set : The DTDs generated by the two algorithms for the real-life 
data set are shown in Table 10. Of the five DTDs, XTRACT is able to infer all five correctly, 
hi contrast, DDbE is able to derive accurate DTDs only for DTDs 1 and 2, and an 
approximate DTD for DTD 3. Basically, with an additional factoring step, DDbE could 
obtain the original DTD for DTD 3. Note, however, that DDbE is unable to infer the simple 
DTD 4 that contains the metacharacter ?. In contrast, XTRACT is able to deduce this DTD 
because its factorization step takes into account the identity element "1" and simplifies 
expressions of the form 1 1 a to a?. DTD 5 represents an interesting case where XTRACT is 
able to mine a DTD containing regular expressions containing nested * s. This is due to the 
generalization module that iteratively looks for sequencing patterns. On the other hand, 
DDbE simply over-generalizes the DTD 5 by ORing all the symbols in it and enclosing them 
within the metacharacter +. 



No. 


Original DTD 


DTD Inferred by XTRACT 


DTD Inferred by DDbE 


1 


abcde|ef gh|ij|klm 


abcdejef gh|ij|klm 


abcde|efgh|ij|klm 


2 


(a|b|c|d|f )* gh 


(a|b|c|d|f )* gh 


gh(a|b|c|d|f)+gh 


3 


(a|b|c|d)* |e 


(a|b|c|d)* |e 


(e(a|c|d|b)+e) 


4 


(abcde)* f 


(abcde)* f 


(f(a|e|d|c|b)+f) 


5 


(ab)* lcdefl(ghi)* 


(ab)* Icdef |(ghi)* 


cdef(a|b|g|i|h)+cdef 


6 


abcdef(g|h|ilj)(k|l|m|n 
|o) 


abcdef(g|h|itj)(k|l|m|n|o) 


abcdef(j(o|l|m|n|k)|g(o|l|n| 
m|k)|h(m|l|n|k|o)|i(o|l|n|m| 
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k)) 


7 


(a|b|c)d* e* (f gh)* 


(alb|c)d* e* (fgh)* 


((e|b|a)d+e+lad+|bd+|e(e+| 
d+)?|ad*|be*))(flh|g)+((a| 
b|c)d+e+|c(e+|d+)?|a(e+|d 
+)?|b(e+|d+)?) 


8 


(a|b)(cdefg)* 
hijklmnopq(r|s)* 


(a|b)(cdefg)*hijklmnopq(r|s) 
* 


((((alb)hijabedef 
g)|b|a)(c|g|f 
|e|d|s|r)+((b|a)?hijkamnop 

q)) 


9 


(abed)* |(e|f |g)* 
|h|(ijklm)* 


(abed)* |(i|klm)* lh|(e|f |g)* 


h(a|d|c|b|e|g|f|i|m|l|k[j)+h 


10 


a* |(b|cld|e|f )* 
Ighl(iljlk)* 1 (Imn)* 


a* |(b|c|d|e|f )* |gh|(iO|k)* 1 
(Imn)* 


(a+|gh)(e|f 
|d|ilj|l|n|m|klc|b)+(a+|gh) 



Table 9: DTDs generated by XTRACT and DDbE for Synthetic Data Set 



No. 


Simplified DTD 


DTD Obtained by XTRACT 


DTD obtained by DDbE 


1 


a|b|c|d|e 


a|b|c|d|e 


a|b|c|d|e 


2 


(a|b|c|d|e)* 


(a|b|c|d|e)* 


(a|b|c|d|e)* 


3 


(ab* c* ) 


ab* c* 


(ab+c* )|(ac* ) 


4 


a* b?c?d? 


a* b?c?d? 


(a+b(c|(c?d))?)|((b|a+)?cd)l 
((a+|b)?d)|((a+|b)?c)|a+|b) 


5 


(a(bc)+d)* 


(a(bc)* d)* 


(a|b|c|d)+ 



Table 10: DTDs generated by XTRACT and DDbE for Real-life Data Set 



The quality of the DTDs inferred by XTRACT was compared with those returned by 
the IBM alphaworks DDbE (Data Descriptors by Example) DTD extraction tool on synthetic 
as well as real-life DTDs. In the experiments, XTRACT outperformed DDbE by a wide 
margin, and for most DTDs it was able to accurately infer the DTD while DDbE completely 
failed to do so. A number of the DTDs which were correctly identified by XTRACT were 
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fairly complex and contained factors, metacharacters, and nested regular expression terms. 
Thus, the results clearly demonstrate the effectiveness of XTRACT's approach that employs 
generalization and factorization to derive a range of general and concise candidate DTDs, and 
then uses a selection step preferably comprising minimum descriptor length (MDL) principles 
as the basis to select from amongst them. 
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What is claimed is: 

1. A document descriptor extraction method comprising the steps of: 
generalizing input sequences associated with a document to develop general 

sequences, said input sequences reflecting the structure of a document; 

factoring said input sequences and said general sequences to develop factored 
sequences; 

selecting a document descriptor from said input sequences, said general sequences, 
and said factored sequences using minimum descriptor length (MDL) principles. 

2. The method of claim 1, wherein said selecting step comprises the steps of: 
encoding said input sequences, said general sequences, and said factored sequences; 

and 

selecting a document descriptor which encompasses all of said input sequences and 
exhibits a minimum MDL cost. 

3. The method of claim 2, wherein said encoding step comprising: 
seq(D,s) = e if D=s, if D does not contain metacharacters; 
seq(Di...Dk, si...Sk) = seq(Di,si)...seq(Dk,Sk); 
seq(Di|...|Dm,s) = i seq(Di,s); 

seq(D*,Si...Sk) = {k seq(D,Si)...seq(D,Sk) if k>0; 0 otherwise}; 

wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed to 
encode index i. 
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4. The method of claim 3, wherein said minimum MDL cost is detemiined by 
employing an algorithm to solve a facility location problem (FLP), said FLP modified to 
compute said minimum MDL cost of potential document descriptors. 

5. The method of claim 4, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

6. The method of claim 5, wherein said minimum MDL cost comprises summing a 
first length of bits describing the DTD and a second length of bits for encoding the sequences. 

7. A document descriptor extraction method comprising the steps of: 
generalizing input sequences to develop general sequences, said input sequences 

reflecting the structure of data within a document; 

selecting a docxmient descriptor fi-om said input sequences and said general sequences 
using minimum descriptor length (MDL) principles. 

8. The method of claim 7, wherein said selecting step comprises the steps of: 
encoding said input sequences and said general sequences; and 

selecting a document descriptor which encompasses all of said input sequences and 
exhibits a minimum MDL cost. 

9. The method of claim 8, wherein said encoding step employs an algorithms which 
applies a set of rules comprising: 

seq(D,s) = £ if D=s, if D does not contain metacharacters; 
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seq(Di...Dk, si...Sk) = seq(Di,Si)...seq(Dk,Sk)5 if D is a concatenation of Di...Dk; 
seq(Di|...|Dni,s) = i seq(Di,s); 

seq(D*5Si...Sk) = {k seq(D,Si)...seq(D,Sk) if k>0; 0 otherwise}; 

wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed to 
encode index i. 

10. The method of claim 9, wherein said minimum MDL cost is determined by 
employing an algorithm to solve a facility location problem (FLP), wherein said FLP is 
modified to compute said minimum MDL cost of potential document descriptors. 

11. The method of claim 10, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

12. The method of claim 11, wherein said minimum MDL cost comprises summing a 
first length of bits describing the DTD and a second length of bits for encoding the sequences. 

13. The method of claim 7, further comprising the step of: 

factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available for said step of selecting; 

14. A computer program for generalizing input sequences to develop general 
sequences comprising: 

a discover OR patterns procedure; 
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a discover sequence patterns procedure; and 

a generalize procedure which calls said discover sequence patterns procedure 
and calls said discover OR patterns procedure, wherein said discover OR patterns procedure 
is nested within said discover sequence patterns procedure. 

15. The computer program of claim 14, further comprising a partition procedure called 
by said discover OR pattems procedure. 

16. A document descriptor extraction method of claim 1 5, utilizing a computer 
program for generalizing input sequences as set forth in claim 15. 

17. The method of claim 16, comprising: 

generalizing said input sequences to create general sequences using said computer 
program; and 

selecting a document descriptor from said input sequences and said general sequences. 

18. The method of claim 17, further comprising: 

factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available to said step of selecting. 

19. The method of claim 18, wherein said step of selecting employs minimum 
descriptor length (MDL) principles. 
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20. The method of claim 19, wherein said document descriptor is a document type 
descriptor (DTD) and said document is an extensible Markup Language (XML) document. 

21. A method for generalizing input sequences to develop general sequences 
comprising the steps of: 

discovering OR patterns among said input sequences; and 

discovering sequence patterns among said input sequences and OR patterns. 

22. The method of claim 21, wherein said step of discovering OR patterns comprises 
the step of partitioning said input sequences. 

23. A document descriptor extraction method, utilizing a method for generalizing 
input sequences as set forth in claim 22. 

24. The method of claim 23, further comprising the steps of: 

generalizing said input sequences to create general sequences using said method for 
generalizing input sequences; and 

selecting a document descriptor from said input sequences and said general sequences. 

25. The method of claim 24, further comprising the steps of: 

factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available to said step of selecting. 
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26. The method of claim 25, wherein said step of selecting employs minimum 
descriptor length (MDL) principles. 

27. The method of claim 26, wherein said document descriptor is a document type 
descriptor (DTD) and said document is an extensible Markup Language (XML) document. 
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TITLE: DOCUMENT DESCRIPTOR EXTRACTION METHOD 



ABSTRACT OF THE DISCLOSURE 

The present invention discloses a document descriptor extraction method and system. 
The document descriptor extraction method and system creates a document descriptor by 
generaUzing input sequences within a document; factoring the input sequences and 
generaUzed input sequences; and selecting a document descriptor from the input sequences, 
generalized sequences, and factored sequences, preferably using minimum descriptor length 
(MDL) principles. Novel algorithms are employed to perform the generalizing, factoring, and 
selecting. 

M:\SWeed\Lucent Technologies\23 J60\Patent Office\clraft02.wpd 
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TITLE: DOCUMENT DESCRIPTOR EXTRACTION METHOD 



FIELD OF THE INVENTION 

The present invention relates to electronic documents. Specifically, the present 
invention relates to determining document descriptors from data within electronic documents. 

BACKGROUND OF THE INVENTION 

The number of documents available in electronic format has exploded. With the 
number of available electronic documents increasing rapidly, it is important to be able to 
quickly and accurately search the available electronic documents. In addition, it is desirable 
to be able to store data into electronic documents and generate new electronic documents 
which are similar in structure to existing electronic documents. Hence, tools which assist in 
the querying of electronic documents, the creation of electronic documents, and the storage of 
data into electronic documents are desirable. 

Electronic documents for display over the Internet and/or an Intranet are commonly 
stored in a Standard Generalized Markup Language (SGML) format SGML is a standard for 
how to specify a document markup language or tag set. SGML is not in itself a document 
language, but a description of how to specify one. The SGML format provides for the 
inclusion of a document type descriptor (DTD). A document's DTD specifies how the data 
within a document should be organized. One SGML format for storing data within electronic 
documents which is becoming increasingly popular is extensible Markup Language (XML). 
XML is rapidly emerging as the new standard for representing and exchanging data on the 
World Wide Web (web). An XML document may be accompanied by a document type 
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descriptor (DTD). For example, in an XML document, the DTD may specify the tags which 
can be used, the order in which the tags appear, how the tags are nested, and tag attributes. 
Thus, the DTD plays an important role in the storage of data to the XML document, the 
generation of similar documents, and increasing the efficiency of queries of the XML 
document. Efficiency is achieved by using the knowledge of the structure of the data to 
remove elements that cannot potentially satisfy the query. 

Although DTDs are helpful in the storage, generation, and retrieval of data related to 
an XML document, DTDs are not mandatory. Since DTDs are not mandatory, many XML 
documents exist which do not contain DTDs. In addition, since only a small portion of the 
electronic documents in existence today are in an XML format, initially the majority of XML 
documents will likely be automatically generated from pre-existing non-XML documents. In 
many instances, the automatically generated XML formatted documents will not contain 
DTDs. Therefore, a tool for automatically generating DTDs is desirable for improving data 
storage and retrieval. 

Others have attempted to automatically generate DTDs with varying degrees of 
success. One system is IBM's Data Descriptors by Example (DDbE) system. The goal of 
DDbE is to give users a good start at creating DTDs for their own applications. However, 
this system and other available systems do not produce highly accurate DTDs for all XML 
documents, especially complex XML documents. Since accurate DTDs enable efficient 
storage and retrieval of data, improved methods for extracting accurate DTDs from XML 
documents are desirable. 
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SUMMARY OF THE INVENTION 

^ { Deleted: ^Section Break (Continuous)- 

^Thejjresent inyenti^^^^^ developing a description of thejaxout of an elec^qnjc 

document from data within the document. The present invention is especially useful for 
determining document type descriptors (DTDs) of electronic documents in a Standard 
Generalized Markup Language (SGML) format. 

The present invention comprises generalizing input sequences generated from an 
electronic document. The input sequence are generalized to create generalized sequences 
which are representative of the input sequences. Each generalized sequence encompasses one 
or more input sequences in a more general form. Next, the present invention comprises 
selecting a description of the layout of the electronic document from the input sequences and 
generalized sequences. Selecting a description comprises selecting one or more of the input 
sequences and generalized sequences such that every input sequence is encompassed by the 
selected sequences. Preferably, the selection is performed using minimum descriptor length 
(MDL) principles. 

Additionally, the present invention may comprise factoring the input sequences and 
generalized sequences after generalizing to create factored sequences which can be included 
in the selection of the description. Each factored sequence encompasses one or more input 
sequences and generalized sequences. The factored sequence are combined with the input 
sequences and generalized sequences, thereby creating a potentially better selection of 
sequences from which a description may be selected. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is a flow chart of a preferred document type descriptor (DTD) extraction 
system in accordance with the present invention; and 

Figure 2 is an illustrative depiction of the output of each step and the selection process 
of the preferred document type descriptor extraction system depicted in Figure 1 in 

accordance with the present invention. 

^ { Deleted: >^ction Break (Next Paqeh] 

I 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention relates to inferring (i.e., determining) document descriptors 
from data within electronic documents. For illustrative purposes, the present invention is 
described in terms of inferring document type descriptors (DTDs) from data within 
extensible Markup Language (XML) formatted documents. However, it will be readily 
apparent to those skilled in the art that the present invention could be applied to other types of 
markup languages which provide document descriptions that are currently available or 
developed in the future, such as markup languages which conform to the Standard General 
Markup Language (SGML) format. The inferred DTD contains valuable information about 
the structure of the XML documents that it describes. The structural information may be used 
to efficiently query the XML document, store data to the XML document, or generate similar 
XML documents. 

A sample XML document and its associated DTD are as follows: 

Sample XML Document 

<article> 

<title> A Relational Model <^t_itj^_ - { Deleted: for Urge Shared Data Banks 

4 
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<author> 

<name> ^._B. Cod </name> - { Deleted: e. f. Codd 



<affi liation> tA BC Cqrp._ </affiJ iation> ^.-{ Deleted: IBM Research 

</author> 
<yarticle> 
<article> 

<title> Another Model </titIe> 

<author> 



Deleted: XTRACT: a system for 
Extracting DTDs 



<nanie> ^A . S w tft </name> ^ - - ■ { Deleted: m. Garofalakis 

<affiliation> ^YZjCo_rp.j ^affijiation> - [ Deleted: BeM ubs 

</author> 

<author> 



<name> TB. Quick j<^nanie> _ - j Deleted: a. CHonis 

<afflliation> ^YX_Cojp._ </affijiation> - {Deleted: Stanford university 

</author> 
<author> 



<nanie> TQ.Gqld _*^name> - -{ Deleted: r. Rastogi 



<afFiliation> ^YZ Coip. </affijiation> - - ( Deleted: Beii ubs 

T _ _^?*i^!?9?' - • { Deleted: ^Section Break (Continuous)" 

<author> 



<name> p,Jienry <yname> _ _ _ ^ ^ - j Deleted: s. Seshadri 
<afFiliation> ^YZjCorp.j ^affili^^^^ { Deleted: Beii Labs 

<yauthor> 

<author> 



<name> ,£^Plant <7nam^_ j Deleted: k. shhiT 



<affiliation> ^XYZ Cojp. </affiliation> ■ ( Deleted: Beii Labs 

</author> 
</article> 



Sample Document Type Descriptor (DTD) 

<!ELEMENT article (title, author*)> 
<! ELEMENT title (#PCDATA)> 
<!ELEMENT author (name, afFiliation)> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT affiliation (#PCDATA)> 



A DTD describes the structure of an XML document. A DTD constrains the structure 



of an element by specifying a regular expression with which its sub-element sequences must 
conform. The DTD declaration sequence uses commas for sequencing, | for exclusive OR, 
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parenthesis for grouping and meta-characters ?, *, + to denote zero or one, zero or more, and 
one or more, respectively. 

In the sample XML document above and its associated DTD, the start of an element 
such as article is indicated by <article> and the end of the element is indicated by </article>. 
Each element may comprise sub-elements and/or data. For example, for the element article, 
title and author are sub-elements. Likewise, sub-elements may further contain additional sub- 
elements. For example, author contains sub-element name and sub-element affiliation. 

In a preferred embodiment, the present invention applies algorithms in three steps to 
compute a DTD from a set of input sequences. They are (1) generalizing, (2) factoring, and 
(3) selecting. 

^ ^ Deleted: ^Section Break (Continuous)^ 

^The input sequences are groupings of sub-elements contained wit^^^ ??9h_9.9?y??19?. _ ^ ^ ^ 
of an element. For an element, such as article, it is straight forward to compute the sequence 
of sub-elements nested within each <article> </article> pair in the XML document. The set 
of input sequences comprises one sequence for each occurrence of element <article>. For 
example, in the above XML document sample the input sequences for <article> would be 
input sequences <title><author> and <title><author><author><author><author><author>. 
For ease of description, the first letter of the sub-element may be used as a shorthand for 
describing sequences (e.g., <title> <author> is represented by ta and 
<title><author><author><author><author><author> is represented by taaaaa.) 

In the preferred embodiment, the input sequences are generalized to create generalized 
sequences. The generalized sequences and input sequences are then factored to create 
factored sequences. Each factored sequence may encompass one or more input sequences and 
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generalized sequences, thereby creating additional sequences which may be selected as a part 
of a DTD. The factoring step is optional. However, using the factoring step results in 
potentially better DTDs. Factoring leads to better DTDs by creating additional sequences 
from which an appropriate DTD may be selected. A DTD which encompasses all of the input 
sequences is then selected from the input sequences, generalized sequences, and factored 
sequences. 

^ { Deleted: -Section Break (Continuous)- 

}n the generalization step^ patterns withm the jnput sequences are detected and more ^ ' 

"general" regular expressions are substituted for them to create "generalized" sequences. In a 
preferred embodiment, the "generalized" sequences and the input sequences are then 
processed by the factorization step which factors common expressions to make them more 
succinct. The factorization step yields "factored" sequences. The first two steps along with 
the input sequences produce a series of potential DTDs that vary in their conciseness and 
precision. A selection step then selects a DTD from the candidates that strikes the right 
balance between conciseness and preciseness - that is, a DTD that is concise, but at the same 
time, is not too general. In a preferred embodiment, the selection step employs minimum 
descriptor length (MDL) principles for selecting a DTD. 

Figure 1 depicts a flow chart 100 illustrating the steps for inferring a DTD in 
accordance with a preferred embodiment of the present invention. The input sequences I are 
comprised of sub-elements a, b, c, d, and e. The input sequences are first processed by a 
generalization module 1 10 which produces generalized sequences. The generalized 
sequences are combined with the input sequences to create a set of potential DTDs identified 
by Sq. Optionally, the potential DTDs are factored using a factoring module 120. The 
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factoring module produces additional potential DTDs which are combined with the potential 
DTDs output by the generalization module 1 10 to create a set of potential DTDs identified by 
Sf. Finally, the selecting module 130 infers (i.e. selects) a DTD from all of the potential 
DTDs Sp. Preferably, the selecting module 130 incorporates MDL principles. 

Figure 2 graphically depicts the selection of a DTD from all of the potential DTDs. 
The selected DTD must encompass all of the original input sequences Sp. It can be seen that 
(ab)* encompasses input sequences ab and abab. Also, (a|b) (c|d) encompasses input 
sequences ac, ad, be, and bd. Finally, b*(d|e) encompasses bd, bbd, and bbbbe. The selected 
potential DTDs, when combined using ORs, encompass all of the original input sequences. 
The result is a concise and precise DTD. 



1. GENERALIZING 

^ { Deleted: ^Section Break (Continuous^ 

tTh5JV*?yty_o/ihe data |&J5e_descrij3tqr_(pjI]9_^^ the jel^^^ioO Pto5??.sJ§ 
very dependent on the set of candidate DTDs available. If the selection were based on the 
input sequences only, then the final DTD output by the selection step would simply be the OR 
of all the input sequences. For example, in the above XML document sample, the DTD for 
<article> would comprise ta and taaaa (i.e., <title><author> and 

<title><author><author><author><author><author>.) However, this is not a desirable DTD 
since it is neither concise nor intuitive. A more concise and intuitive DTD would be the 
single sequence ta* which encompasses both ta and taaaaa. Thus, in order to infer 
meaningftil DTDs, the candidate DTDs should be general. Ideally, each candidate DTD 
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encompasses more than one input sequence. The goal of the generalization module 1 10 is to 
achieve this objective. 

The generalization module 1 10 of the present invention infers a number of regular 
expressions which have been found to frequently appear in real-life DTDs. Below, are 
examples of regular expressions from real-life DTDs that appear in the Newspaper 
Association of America (NAA) Classified Advertizing Standards XML DTD^ 

a* be* : DTDs of this form are generally used to specify tuples with set-valued attributes. 
<!ELEMENT account-info (account-number, sub-account-number*)> <!-- 
Specification for account identification information -> 

(abc)* : This type of DTD is used to represent a set (or a list) of ordered tuples. 

<!ELEMENT days-and-hours (date, time)+> <!- provide times/dates when job fairs 
will be held -> 

(a|b|c)* : The DTD of the form (a|b|c)* is frequently used to represent a multiset containing 
the elements a, b and c. This DTD is very useful since the elements in the multiset are 
allowed to appear multiple times and in any order in the document. For example, the 
following DTD specifies that the support information for an ad can consist of an 
arbitrary number of audio or video clips, photos, and further these can appear in any 
order. 

<!ELEMENT support-info (audio-clip | file-id | graphic | logo | new-list | photo | 
video-clip | zz-generic-tag)*> <!-- support information for ad content --> 

((ab)* c)* : This type of DTD permits nesting relationships among sets (OR lists). 

<!ELEMENT transfer-info (transfer-number, (from-to, company-id)+, contact-info) *> 
<!-- provides parent information through the multilevel aggregation process, may be 
repeated --> 

Table 1 depicts pseudo code for a preferred generalization algorithm (Procedure 
GENERALIZE). Procedure GENERALIZE infers several DTDs for each input sequence 
independently and adds them to the set So. The generalize algorithm may over-generalize in 
some cases (since DTDs are inferred based on a single sequence), however, the selection step 
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in selecting module 130 will ensure that such overly-general DTDs are not chosen as part of 

the final inferred DTD, if there are better alternatives. The generalization step will provide 

several alternate candidates in addition to the input sequences for the selection step. 

The algorithm can infer regular expressions that are more complex than the above, 

however, complex expressions, such as (ab?c* d?)*, that are less likely to occur in practice, 

may be excluded. 

procedure GENERALIZE(7) 
begin 

1 . for each sequence s 'ml 

2. add s to Sg 

3. forr:=2, 3,4 

4. s' := DISC0VERSEQPATTERN(5, r) 

5. for ^/:= 0.1 •|.s'|,0.5-|y|,|5'| 

6. s" := DISC0VER0RPATTERN(5', d) 

7. add 5" to 
end 

procedure DISCOVERS EQPATTERN(5, r) 
begin 

1. repeat 

2. let X be a subsequence of s with the maximum number (> r) of contiguous repetitions in 
s 

3. replace all (> r) contiguous occurrences of x in 5 with a new auxiliary symbol A, = (x)* 

4. until (s no longer contains > r contiguous occurrences of any subsequence x) 

5. returns 
end 

procedure DISCO VERSEQPATTERN(5, cf) 
begin 

1. 5y, S2,..., s„ := PARTITI0N(5, cf) 

2. for each subsequence sj in sj, 52 .., Sn 

3. let the set of distinct symbols in sj be aj, a2,..., am 

4. if(m>l) 

^. replace subsequence Sj in sequence j bj a new auxiliary _symbolj4/j=^ ^^-^ • { Deleted: Section Break (Continuous)- 

(^/J "il ^m)* - - ' i Fonnatted: Font: Not Italic 

6. return S ^ \ Formatted: Font: Not Italic 

end 
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procedure PARTITI0N(5, d) 
begin 

1 . / := start := end := 1 

2. Si :- sfstart, end] 

3. while {end< \s\) 

4. while (end < \s\ and a symbol in Sf occurs to the right of 5/ within a distance d) 

5. end := ew<i + 1 ; 5, := sfstart, end] 

6. if(en^/<|5|) 

7. /:=/+!; := cn^^ + 1 ; end := end + 1 ; 5, := s[start, end] 

8. return 5/, 52"; ^/ 
end 

Table 1 : Generalization Algorithm 
The essence of procedure GENERALIZE are the procedures 
DISCO VERSEQPATTERN and DISCO VERORPATTERN which are repeatedly called with 
predefined parameter values. 



Discovering Sequencing Patterns (Procedure DISCOVERSEOPATTERN^ 

Procedure DISCO VERSEQPATTERN, shown in Table 1, takes an input sequence s 
and returns a candidate DTD that is derived from s by replacing sequencing patterns of the 
form XX... X, for a subsequence x in s, with the regular expression (x)*. In addition to s, the 
procedure also accepts as input, a threshold parameter r > 1 which is the minimum number of 
contiguous repetitions of subsequence x in s required for the repetitions to be replaced with 
(x)*. In case there are multiple subsequences x with the maximum number of repetitions in 
step 2 of procedure DISCO VERSEQPATTERNS, the longest among them is chosen, and 
subsequent ties are resolved arbitrarily. 

Note that instead of introducing the regular expression term (x)* into the sequence s, 
an auxiliary symbol that serves as a representative for the term is introduced. The use of 
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auxiliary symbols enable the description of the algorithms to remain simple and clean since 
the input to them is always a sequence of symbols. In a preferred embodiment, there is a 
one-to-one correspondence between auxiliary symbols and regular expression terms in the 
present invention; thus, if the auxilliary symbol A denotes (be)* in one candidate DTD, then 
it represents (be)* in every other candidate DTD. Also, procedure 

DISCO VERSEQPATTERN may perform several iterations and thus new sequencing patterns 
may contain auxiliary symbols corresponding to patterns replaced in previous iterations. For 
example, invoking procedure DISCOVERSEQPATTERN with the input sequence s = 
abababcababc and r = 2 yields the sequence AjcAic after the first iteration, where A i is an 
auxiliary symbol for the term (ab)*. After the second iteration, the procedure returns the 
candidate DTD A2, where A2 is the auxiliary symbol corresponding to ((ab)* c)*. Thus, the 
resulting candidate DTD returned by procedure DISCOVERSEQPATTERN can contain ♦s 
nested within other *s. Finally, DISCOVERSEQPATTERN is invoked with three different 
values for the parameter r to control the aggressiveness of the generalization. For example, 
for the sequence aabbb, DISCOVERSEQPATTERN with r = 2 would infer a* b*, while with 
r = 3, it would infer aab*. In the selection step, if many other sequences are encompassed by 
aab*, then a DTD of aab* may be preferred to a DTD of a* b* since it more accurately 
describes the input sequences. 

Discovering OR Patterns (Procedure DISCOVERORPATTERN) 

Procedure DISCOVERORPATTERN, shown in Table 1, infers patterns of the form 
(ai|a2| ... |am)* based on the locality of these symbols within a sequence s. The locality is 
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identified by first partitioning (performed by procedure PARTITION, shown in Table 1) the 
input sequence s into the smallest possible subsequences Si, S2, Sn, such that for any 
occurrence of a symbol a in a subsequence Si, there does not exist another occurrence of a in 
some other subsequence sj within a distance d (which is a parameter to 
DISCOVERORPATTERN). Each subsequence Si in s is then replaced by the pattern (ai|a2| ... 
|am)* where ai, ... , am are the distinct symbols in the subsequence Si. If Si contains frequent 
repetitions of the symbols ai|a2|...|am in close proximity, then it is very likely that Si originated 
from a regular expression of the form (ai|a2|...|ain)*. For illustrative purposes, for the input 
sequence abcbac, procedure DISCOVERORPATTERN returns: 

• aAiac for d = 2, where Ai = (b | c)* ; 

• aA2 for d = 3, where A2 = (a | b | c)* ; and 

• A2 for d = 4, where A2 = (a | b | c)* . 

A preferred component for discovering OR patterns is procedure PARTITION, shown 
in Table 1 . For a sequence s, s[i j] denotes the subsequence of s starting at the i*** symbol and 
ending at the j* symbol of s. Procedure PARTITION constructs the subsequences in the 
order Si, S2, and so on. Assuming that S\ through sj have been generated, it constructs Sj+i by 
starting sj+i immediately after sj ends and expanding the subsequence sj+i to the right as long 
as|here is_^s_ymboI in Sj+i that occurs wjthin a distance d to ^_e_right of Sj+h By constmctiorij^ 
there cannot exist such a symbol to the left of sj+i. 

Note that procedure GENERALIZE invokes DISCOVERORPATTERN on the DTDs 
that result from calls to DISCOVERSEQPATTERN and therefore it is possible to infer more 
complex DTDs of the form (a|(bc)* )* in addition to DTDs like (a|b|c)*. For instance, for the 
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input sequence s = abcbca, procedure DISCOVERSEQPATTERN invoked with r = 2 would 
return s' = aAia, where Ai = (be)* , which, when input to DISCOVERORPATTERN returns 
s" = A2 for d = |s'|, where A2 = (a|Ai)*. Further, DISCOVERORPATTERN is invoked with 
various values of d (expressed as a fraction of the length of the input sequence) to control the 
degree of generalization. Small values of d lead to conservative generalizations while larger 
values result in more liberal generalizations. The size of d is based on desired design 
characteristics. 

II. FACTORING 

In a preferred embodiment, the factoring module 120 uses a factoring step to derive 
factored forms for expressions that are an OR of a subset of the candidate DTDs, So, out of 
the generalization module 1 10. For example, for candidate DTDs ac, ad, be and bd in So, the 
factoring step would generate the factored form (a | b)(c | d). Note that since the final DTD is 
an OR of candidate DTDs, Sf, out of the factoring module 120, the factored forms are also 
candidates. Further, a factored candidate DTD, because of its smaller size, has a lower 
minimum description length (MDL) cost, and is thus more likely to be chosen in the selection 
step, if MDL principles are used. Thus, since factored forms (due to their compactness) are 
more desirable, factoring can result in better quality DTDs. 

Factored DTDs are common in real life. For example, in the sample DTD, an article 
may be categorized based on whether it appeared in a workshop, conference or journal; it may 
also be classified according to its area as belonging to either computer science, physics, 
chemistry etc. Thus, the DTD (in factored form) for the element article would be as follows: 
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<!ELEMENT article(title, author*, (workshop \ conference | journal), 
(computer science | physics | chemistry | ...)) 

The set of candidate DTDs, Sf, output by the factorization module, 120, in addition to 
the factored forms generated from candidates in So, also contains all the DTDs in So- Ideally, 
factored forms for every subset of So, should be added to Sf to be considered by the selection 
step. However, this may be impractical, since Sq could be very large. Therefore, a heuristic 
may be used to select subsets of candidates in So that when factored yield good factored 
DTDs. In a preferred embodiment, the factoring algorithm is an adaptation of factoring 
algorithms for boolean expressions which are well known in the art. 

Selecting Subsets of Sr, to Factor 

Intuitively, a subset S of So out of generalization module 1 10 is a good candidate for 
factoring if the factored form of S is much smaller than S itself In addition, even though So 
may contain multiple generalizations that are derived from the same input sequence, it is 
highly unlikely that the final DTD will contain two generalizations of the same input 
sequence. Thus, factoring candidate DTDs in So that encompass similar sets of input 
sequences does not lead to factors that can improve the quality of the final DTD. 

For a subset S of So to yield good factored forms it must satisfy the following two 
properties: 

(1.) Every DTD in S has a common prefix or suffix with a number of 
other DTDs in S. Further, as more DTDs in S share common prefixes or suffixes, 
or as the length of the common prefixes/suffixes increases, the quality of the 
generated factored form can be expected to improve. 
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(2.) The overlap between every pair of DTDs D; D' in S is minimal, that 
is, the intersection of the input sequences encompassed by D and D' is small. This 
is important because, as mentioned above, a factored DTD adds little value (from 
an MDL cost perspective) over the candidate DTDs from which it was derived if 
it cannot be used to encode a significantly larger number of input sequences 
compared to the sequences encompassed by each individual DTD. 

^ ^Deleted: . The \ 

In order to state properties (1) and (2) for a set S of DTDs more formall y^ the_ 

following notation is used. For a DTD D, let cover(D) denote the input sequences in I that are 
encompassed by D (note that auxiliary symbols are expanded completely when cover for a 
DTD is computed). Then, overlap(D, D') is defined as the fraction of the input sequences 

encompassed by D and D' that are common to D and D', that is, 

^ {Deleted: ^Section Break (Continuous^ J 

iD- '''' 

Thus, for a sufficiently small value of a (user-specified) parameter 6, by ensuring that 
overlap(D,D') < 5 for every pair of DTDs D and D' in S, it can be ensured that S satisfies 
property (2) mentioned above. 

In order to characterize property (1) more rigorously, the function score(D,S) is 
introduced in equation 2. Function score (D, S) attempts to capture the degree of similarity 
between prefixes/suffixes of DTD D and those of DTDs in the set S of DTDs. Intuitively, a 
DTD with a high score with respect to set S is a good candidate to be factored with other 
DTDs in set S. For a DTD D, let pref (D) and suf(D) denote the set of prefixes and suffixes 
of D, respectively. Let psup(p,S) denote the support of prefix p in set S of DTDs, that is, the 
number of DTDs in S for which p is a prefix. Similarly, let ssup(s,S) denote number of DTDs 
in S for which s is a suffix. Then score(D,S) is defined as follows: 

score(D,S) = max({|p| . psup(p,S) : p 6 pref (D)}u{|s| * ssup(s,S) : s e suf (D)}) (2) 

16 
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Thus, the prefix/suffix p=s of D, for which the product of p=s's length and its support 
in S is maximum, determines the score of D with respect to S. If DTD D has a long prefix or 
suffix that occurs frequently in set S, then this prefix can be factored out, thus resulting in 
good factored forms. The function score is thus a good measure of how well D would factor 
with other DTDs in S. 

Procedure FACTORSUBSETS, shown in Table 2, first selects subsets S of sequences 
from within sequences Sq that satisfy properties (1) and (2). Each of these subsets S is then 
factored by invoking procedure FACTOR (in Step 15), depicted in Table 3. Assuming that the 
factoring algorithm returns Fi | F2 1 ... Fm, each of the Fj is added to Sp. 



procedure FACTORSUBSETSCSg) 
begin 

^ , for each DTD D js_5;g^ 

2. Compute score (D,Sg) 

3. Sf:= S' := Sgi SeedSet := 0 

4. for / := 1 to ^ 

5. let D be the DTD in S' with the maximum value for score (D,Sg) 

6. SeedSet := SeedSet u D 

7. S' := S' - {D' : overlap (A DO ^5} 

8. for each DTD D in SeedSet 

9. S := {D} 

10. 5' := Sg - {D' : overlap {D, D") ^5} 

1 1 . while {S' is not empty) 

12. let D' be the DTD in S' with the maximum value for score {D',S) 

13. S'=SkjD' 

14. S' := S' - {D' : overlap {D\ D") ^5} 

15. F:=FACT0R(5) 

16. 5/r:-5/ru{F;;...,F„} /* F= Fy | -IF^V 
end 

Table 2: Choosing Subsets Of Sg For Factoring 
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Procedure FACTORSUBSETS computes a set S of candidate DTDs to factor. First, k 
seed DTDs for the sets S to be factored are chosen in the for loop spanning steps 4-7. These 
seed DTDs have a high score value with respect to So and overlap minimally with each other. 
Thus, it is ensured that each seed DTD not only factors well with other DTDs in Sg, but is 
also significantly different from other seeds, in steps 9-14, each seed DTD is used to construct 
a new set S of DTDs to be factored (thus, only k sets of DTDs are generated). After 
initializing S to a seed DTD D, in each subsequent iteration, the next DTD D' that is added to 
S is chosen greedily (i.e., the one whose score with respect to DTDs in S is maximum and 
whose overlap with DTDs already in S is less than 6). 

Algorithm Fqr Factojing a Set of DJT)^^ 

Algorithms for computing the optjmum factored fonn,_thatjs, the one with the 

minimum number of literals are known in the art. However, the complexity of these known 
techniques may be impractical. In a preferred embodiment, heuristic factoring algorithms for 
boolean functions which are known in the art are adapted for use in the present invention. 
Factored forms of boolean functions are commonly used in VLSI design. 

There is a close correspondence between the semantics of DTDs and those of boolean 
expressions. The sequencing operator (,) in DTDs is similar to a logical AND in boolean 
algebra, while the OR operator (|) is like a logical OR. However, there exist certain 
fundamental differences between DTDs and boolean expressions. First, while the logical 
AND operator in boolean logic is commutative, the sequencing operator in DTDs is not (the 
ordering of symbols in a sequence matters!). Second, in boolean logic, the expression a | ab 
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is equivalent to a; however, the equivalent DTD for a | ah is ab?. The boolean algorithms can 
be modified to create a factoring algorithm to handle the semantics of the DTDs. The 
pseudo-code for procedure FACTOR, is shown in Table 3. Procedure FACTOR is a 
preferred embodiment of the factoring algorithm used in factoring module 120. 



procedure FACTOR(S) /* S is the set of sequences to be factored */ 
begin 

1. DivisorSet := FINDALLDIVIS0RS(5) 

2. if (DivisorSet =0) 

3. return or of sequences in S 

4. DivisorList := 0 

5. for each divisor Kin DivisorSet 

6. a ^ •= DIVIDE(5, V) 

7. add ( K R) to DivisorList 

8. find the most compact triplet (Fi Qi Ri) in DivisorList 

9. return (FACTOR(K,))(FACTOR(e/)) | FACTOR(/?/) 
end 

procedure FINDALLDIVISORS(S) 
begin 

I J^^yi'^PL^'^i i't ? 

2. for each distinct sequence s such that ^ is a suffix for at least two elements in S 

3. DivisorSet :- DivisorSet u {{p : ps e S}} 

4. return DivisorSet 
end 

procedure DIVIDE(5, V) 
begin 

1 . for each sequence p and V 

2. qp := {s \psz S} 

3. 2:=npevqp 

4. R-S- VoQ 

/* VoQ 'xs the set of sequences resulting from concatenating 
every sequence in Q to the end of every sequence in V */ 

5. return Q, R 
end 

Table 3: Factoring Algorithm 
19 
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As an example of the factoring algorithm, consider the set S = {b, c, ab, ac, df, dg, ef, 
eg} of input sequences corresponding to the expression b|c|ab|ac|df)dg|ef|eg whose factored 
form is a?(b|c)|(d|e)(f|g). Before the steps that procedure FACTOR performs to derive the 
factored form are discussed, the DIVIDE operation that constitutes the core of the factoring 
algorithm is introduced. For sets of sequences S, V, DrVIDE(S,V) returns a quotient Q and 
remainder V such that S = V o Q u R (here, V o Q is the set of sequences resulting from 
concatenating every sequence in Q to the end of every sequence in V). Thus, for the above 
set S and V = {d,e}, DIVIDE(S,V) returns the quotient Q = {f,g} and remainder R = 
{b,c,ab,ac}. The steps executed by FACTOR to generate the factored form are as follows: 

(1.) Compute set of potential divisors for S , These are simply sets of 
prefixes that have a common suffix in S. Thus, potential divisors for S include {d, 
e} (both f and g are common suffixes) and {l,a} (both b and c are common 
suffixes). The symbol "1 " is special and denotes the identity symbol with respect 
to the sequencing operator, that is. Is = si = s for every sequence s. 

(2.) Choose divisor V from set of potential divisors . This is carried out by 
first dividing S by each potential divisor V to obtain a quotient Q and remainder 
R, and then selecting the V for which the triplet (V,Q,R) has the smallest size. In 
our case, V = {d,e} results in a smaller quotient and remainder (Q = {f, g}, R = 

{b, c, ab, ac}) than { l,a} (Q = {b,c}, R = {df,dg,ef,eg}) and is thus chosen. 

13.) Recursively factor V, Q_,_^d R . The final factored fomi is ^^- j Deleted: .Section Break (Continuous)^ ^ 

FACT0R(V)yACf6R(Q)lFACtbR(R), where V ="{dfe"}^ Q ='{f,g} aiid R = 

{b,c,ab,ac}. Here, V and Q cannot be factored further since they have no 

divisors. Thus, FACTOR(V ) = (d | e) and FACTOR(Q) = (f | g). However, R can 

be factored more since { l,a} is a divisor. Thus, repeating the above steps on R, 

we obtain FACTOR(R) = (l|a)(b|c). Thus, the final factored form is 

(l|a)(b|c)|(d|e)(ilg), 

(4.) Simplify final expression bv eliminating "1" . The term (1 |a) in the 

final expression can be further simplified to a?. Thus, we obtain the desired 

factored form for S. 



HI. SELECTING 



20 



PATENT Garofalakis 6-1-36-1 1-10 

The step of selecting comprises selecting a DTD. In a preferred embodiment, the 
DTD comprises one or more sequences from the input sequences, generalized sequences, and 
factored sequences. Alternatively, the DTD may be selected from the input sequences and 
generalized sequences if a factoring step is not used. In a preferred embodiment the step of 
selecting is implemented using minimum descriptor length (MDL) principles. 

Jhe MDL cost qf aJDTD that is used to weigh a set of sequenceSj^ is comprised of: 

(A) the length, in bits, needed to describe the DTD, and 

(B) the length of the sequences, in bits, when encoded in terms of the DTD. 
first, the number of bits required to describe the DTOjsestir^^^ Pf 

MDL cost). Let S be the set of subelement symbols that appear in sequences in I. Let M be 
the set of metacharacters |,* , +, ?, (, ). Let the length of a DTD viewed as a string in S u M, 
be n. Then, the length of the DTD in bits is n log(|I| + |M|). As an example, let I consist of 
the elements a and b. The length in bits of the DTD a* b* is 4 * log(2 + 6) = 12. Similarly, 
the length in bits of the DTD (ab|abb)(aa|ab* ) is 16 ♦ 3 = 48. 
The Encoding Scheme comprises the following steps: 

J[A} seq^(p, s) = e jf P_= ^ .In this case^ DTD P_is_a_seque_nce of symbols from the 
alphabet S and does not contain any metacharacters, 

(B) seq(Dj.,.Dic, si..,Sk) = (Dj, S!)...seq{Dk, Sk) that is, D is the concatenation of 
regular expressions Di..,Dk, and the sequence s can be written as the 
concatenation of the subsequences 5y...5jt, such that each subsequence Si 
matches the corresponding regular expression A. 

(C) seq{Di | ... | Dm, s) = / seq{Du s) that is, D is the exclusive choice of regular 
expressions Di...Dm, and / is the index of the regular expression that the 
sequence s matches. Note that we need ^og m\ bits to encode the index /. 
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(D) seg{D* si.,.Sk) = 

In other words, the sequence s = si...Sk is produced from D* by instantiating 
the repetition operator k times, and each subsequence Si matches the /-th instantiation. 
In this case, since there is no simple and inexpensive way to bound apriori, the 
number of bits required for the index k, we first specify the number of bits required to 
encode k in unary (that is, a sequence of flog Ic] Is, followed by a 0) and then the index 
k using (log ^"1 bits. The 0 in the middle serves as the delimiter between the unary 
encoding of the length of the index and actual index itself. 

Table 4: Encoding Scheme 

The MDL subsystem is responsible for choosing a set S of candidate DTDs from Sf 
such that the final DTD D (which is a logic OR of the DTDs in S) (1) encompasses all 
sequences in I, and (2) has the minimum MDL cost. 

Next, the scheme for encoding a sequence using a DTD (part (B) of the MDL cost) is 
determined. The encoding scheme constructs a sequence of integral indices (which forms the 
encoding) for expressing a sequence in terms of a DTD. The following simple examples 
illustrate the basic building blocks on which the encoding scheme for more complex DTDs is 
built: 

(1 .) The encoding for the sequence a in terms of the DTD a is the empty string 6. 
(2.) The encoding for the sequence b in terms of the DTD a | b | c is the integral index 
1 (denotes that b is at position 1 , counting from 0, in the above DTD). 
(3.) The encoding for the sequence bbb in terms of the DTD b* is the integral index 3 

(denotes 3 repetitions of b). 

^ {Deleted: ^Section Break ( Continuous)^ 

JvJext, the encoding scheme fqr.arbitrary DTDs and arbkrary sequences js generanzed._ 
The sequence of integral indices for a sequence s when encoded is denoted in terms of a 
DTD D by seq(D,s). We define seq(D,s) recursively in terms of component DTDs within D as 
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shown in Table 4. Thus, seq(D,s) can be computed using a recursive procedure based on the 
encoding scheme of the factoring algorithm depicted in Table 4. Note that the definitions of 
the encodings for operators + and ? have not been provided since these can be defined in a 
similar fashion to * (for +, k is always greater than 0, while for ?, k can only assume values 1 
or 0). 

Next the encoding scheme is illustrated using the following example. Consider the 
DTD (ab|c)* (de| f g* ) and the sequence abccabfggg to be encoded in terms of the DTD. 
Below, we list how steps (A), (B), (C) and (D) in Table 4 are recursively applied to derive the 
encoding seq((ab|c)* (de|f g* ); abccabf ggg). 

1. Apply Step (B). seq((ab|c)* ; abccab))seq((de|f g* ); f ggg) 

2. Apply Step (D). 4 seq(ab|c, ab) seq(ab|c, c) seq(ab|c, c) seq(ab|c, ab) seq((de|f 
g*);fggg) 

3. Apply Step (C). 4 0 seq(ab, ab) 1 seq(c, c) 1 seq(c, c) 0 seq(ab, ab) 1 seq(f g*, f 
ggg) 

4. Apply Step (A). 4 0 110 1 seq(f g*, f ggg) 

5. Apply Steps (A), (B) and (D). 4 0 110 13 

In order to derive the final bit sequence corresponding to the above indices, the unary 
representation for the number of bits required to encode the indices 4 and 3 is included in the 
encoding. Thus, the following bit encoding for the sequence is obtained: 

seq((ab|c)* (de| fg*), abccabfggg) = 1 1 10100 0 1 1 0 1 11011 

^ { Deleted: ^Section Break (Continuous)^ 

}n steps (B), (C) and (D), of the encodinjg^scheme it needstobedetermjned if a 

sequence s matches a DTD D. Since a DTD is a regular expression, known techniques for 
finding out if a sequence is encompassed by a regular expression can be used. These known 
methods involve constructing a non-deterministic finite automaton for D and can also be used 
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to decompose the sequence s into subsequences such that each subsequence matches the 
corresponding sub-part of the DTD D, thus enabling the encoding to be determined. 

, Note that there may be multiple ways of partitioning the sequence s such that each 
subsequence matches the corresponding sub-part of the DTD D. In such a case, the above 
procedure can be extended to enumerate every decomposition of s that match sub-parts of D, 
and then select from among the decompositions, the one that results in the minimum length 
encoding of s in terms of D. 

Computing the DTD with Minimum MDL Cost 

Next, the final DTD D (which is a logic OR of a subset S of candidate DTDs in Sf ) 
that encompasses all the input sequences and whose MDL cost for encoding the input 
sequences is minimum is computed. The minimization problem maps naturally to the Facility 
Location Problem (FLP). The Facility Location Problem is well known in the art. The FLP is 
formulated as follows: Let C be a set of customers and J be a set of facilities such that the 
facilities "serves" every customer. There is a cost c(j) of "choosing" a facility] e J and a cost 
d(j, i) of serving customers i 8 C by facility] £ J. The problem definition asks to choose a 
subset of facilities F c J such that the sum of costs of the facilities plus the sum of costs of 
serving every client by its closest chosen facility is minimized, that is 

The problem of inferring the minimum MDL cost DTD can be reduced to the FLP as 
follows: Let C be the set input sequences and J be the set of candidate DTDs in Sf . The cost 
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of choosing a facility is the length of the corresponding candidate DTD. The cost of serving 
client i from facility j, dQ, i), is the length of the encoding of the sequence corresponding to i 
using the DTD corresponding to the facility j. If a DTD j does not encompass a sequence i, 
then we set dQ, i) to 1 . Thus, the set F computed by the FLP corresponds to the desired set S 
of candidate DTDs. Algorithms for solving the FLP are well known in the art. In a preferred 
embodiment, a randomized algorithm is employed to approximate the FLP. 

Having thus described a few particular embodiments of the invention, various 
alterations, modifications, and improvements will readily occur to those skilled in the art. For 
example, the invention may be embodied in computer program instructions stored in a 
computer-readable medium, e.g., floppy disc, hard drive, CD ROM, DVD, ROM, RAM, 
punch card, magnetic tape, etc. Such alterations, modifications and improvements as are 
made obvious by this disclosure are intended to be part of this description though not 
expressly stated herein, and are intended to be within the spirit and scope of the invention. 
Accordingly, the foregoing description is by way of example only, and not limiting. The 
invention is limited only as defined in the following claims and equivalents thereto. 

I 

EXPERIMENTAL RESULTS 

In order to determine the effectiveness of the present invention for inferring the DTD 

of a database of XML documents, we conducted a study with both synthetic and real-life 

DTDs. We also compared the DTDs produced by a DTD extraction tool (XTRACT) in 

accordance with a preferred embodiment of the present invention with those generated by the 

IBM alphaworks DTD extraction tool, DDbE (Data Description by Examplqi._The_results_ 
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indicate that XTRACT outperforms DDbE over a wide range of DTDs, and accurately finds 
almost every original DTD while DDbE fails to do so for most DTDs. Thus, the results 
clearly demonstrate the effectiveness of XTRACTs approach that employs generalization and 
factorization to derive a range of general and concise candidate DTDs, and then uses the 
MDL principle as the basis to select from amongst them. 

The two DTD extraction algorithms considered in the experimental study are as 
follows: 

XTRACT: XTRACT includes all three steps for determining a DTD in 
accordance with the present invention. In the generalization step, we discover 
both sequencing and OR patterns using procedure GENERALIZE. In the 
factoring step, k = ^/\o subsets are chosen for factoring and the parameter q is 
set to 0 in the procedure FACTORSUBSETS. Finally, in the selection step, 
we employ an algorithm which incorporate MDL principles to compute an 
approximation to the facility location problem (FLP). 

DDbE: We used Version 1.0 of the DDbE DTD extraction tool in the 
experiments. DDbE is a Java component library for inferring a DTD from a 
data set consisting of well-formed XML instances. DDbE offers parameters 
which permit the user to control the structure of the content models and the 
types used for attribute declarations. Some of the important parameters of 
DDbE that we used in the experiments, along with their default values, are 
presented in Table 5. 



Parameter 


Meaning 


Default 


c 


Maximum number of consecutive identical tokens not replaced by 

a list 


1 


d 


Maximum depth of factorization 


2 



Table 5: Description of Parameters Used by DDbE 



The parameter c specifies the maximum number of consecutive identical tokens that 
should not be replaced by a list. For example, the default value of this parameter is 1 and thus 
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all sequences containing two or more repetitions of the same symbol are replaced with a 
positive list. That is, aa is substituted by a+. The parameter d determines the number of 
applications of factoring. For a set of input sequences that conform to the DTD of 
a(b|c|d)(e| f|g)h, for increasing values of the parameter d, DDbE returns the DTDs in Table 
6. 



Parameter Value (d) 


DTD Obtained 


1 


(acg|ace|adf|abg|abe|acfjadg|ade|abf)h 


2 


a(cg|ce|df|bg|be|cfldg|de|bf)h 


3 


a((c|b|d)g|(d|c|b)fl(c|b|d)e)h 


4 


a((c|b|d)g|(d|c|b)fl(c|b|d)e)h 



Table 6: DTDs generated by DDbE for Increasing Values of Parameter d 
^s shown in Table 6, ford = 1, factorization is j3_erfonnedonc_e_inw^^^ 
symbol h is factored out. When the value of d becomes 2, the leftmost symbol a is also 
factored out. A further increase in the value of d to 3 causes factorization to be performed on 
the middle portion of the expression and the common expression (b|c|d) to be extracted. 
However, note that subsequent increases in the value of d (beyond 3) do not result in further 
changes to the DTD. This seems to be a limitation of DDbE's factoring algorithm since 
examining the DTD for d = 3, we can easily notice that e, f and g have a common factor of 
(b|c|d) with different placement of the symbols within the parenthesis. However, the current 
version of DDbE cannot factorize this further. 

27 



PATENT Garofalakis 6-1-36-11-10 



^ { Deleted: n 




In order to evaluate the quality of DTDs retrieved by XTRACT, we used both 
synthetic as well as real-life DTD schemas. For each DTD for a single element, we generated 
an XML file containing 1000 instantiations of the element. These 1000 instantiations were 
generated by randomly sampling from the DTD for the element. Thus, the initial set of input 
sequences I to both XTRACT and DDbE contained somewhere between 500 and 1000 
sequences (after the elimination of duplicates) conforming to the original DTD. 
THE DATA 

I r ^j - u \J KJ XJ LJ U XJ, _ — — — _ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — ~— — — — ~ — — — — — — 

Synthetic DTD Data Set : We used a synthetic data generator to generate the synthetic 
data sets. Each DTD is randomly chosen to have one of the following two forms: 

^ { Formatted: Subscript 

Aij^l^i^n and_Ai^A2_A3_ •_• .An. _Thus^a DTD has n building blocks_ where _n^^ a randomly ^ • { Formatted: Subscript 

"* Formatted: Subscript 

chosen number between 1 and mb, where mb is an input parameter to the generator that 
specifies the maximum number of building blocks in a DTD. Each building block Ai further 
consists of ni symbols, where ni is randomly chosen to be between 1 and ms (the parameter 
ms specifies the maximum number of symbols that can be contained in a building block). 

Each building block Aj has one of the following four forms, each of which has an equal . 

^ { Formatted: Subscript J 

probability of occurrence: (1) (^^L^|_.., [ani) (2)_^a^^.... ani.(3)_(.^|i?]^y ••iLanOA (4) i Formatted: Subscript j 

V"""^" i Fonnatted: Subscript j 

(aj32.2y ani}t\.y?[?2 l^A^i's ^^'l9?^S4}'5l*^Pl'RPi IXrO^^I^'-Tl?'!^^ t^A^yDl^^Sl^? , y j Fomiatted: Subscript ] 

\ \ f form atted: Subscript ^^J 
generator essentially generates DTDs containing one level of nesting of regular expression »^tv\ t Formatted: Subscript 

^ \\\ ( Formatted: Subscript j 

terms. *Jx \\ [ Formatted: Subscript J 

^ ( Formatted: Subscript ] 

^n Table Tj we show the sj;n_the_ti_c_DTDs that we considered in the experiments (note t^^^* ( Formatted* Subscript ] 

\ \^^ \ Formatted: Subscript J 

that, in Table 7, we only include the regular expression corresponding to the DTD). The i Formatted- Subscript j 

\ Formatted: Subscript 



\( Format ted: Subscript 

[ Deleted: ^Section Break (Continuous^] 
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DTDs were produced using the generator with the input parameters mb and ms both set to 5. 
Note that we use letters from the alphabet as subelement symbols. 



No. 


Original DTD 


1 


abcde|ef gh|ij|klm 


2 


(a|b|c|d|f )* gh 


3 


(a|b|c|d)* |e 


4 


(abcde)* f 


5 


(ab)* |cdef|(ghi)* 


6 


abcdef{g|h|i[j)(k|l|m|n|o) 


7 


(a|b|c)d* e* (fgh)* 


8 


{a|b)(cdefg)* hijklmnopq(r|s)* 


9 


(abed)* |(e|f |g)* |h|(ijklm)* 


10 


a* |(b|c|d|e|f )♦ |gh|(iy|k)* |(lmn)* 



Table 7: Synthetic DTD Data Set 

The ten synthetic DTDs vary in complexity with later DTDs being more complex than 
the earlier ones. For instance, DTD 1 does not contain any metacharacters, while DTDs 2 
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through 5 contain simple sequencing and OR patterns. DTD 6 represents a DTD in factored 
form while in DTDs 7 through 10, factors are combined with sequencing and OR patterns. 

jle_a[-[ife DTD Data Set: _ W_ejobtained the reaj-Jife DTDs from the Newsp_aper_ 

Association of America (NAA) Classified Advertising Standards XML DTD produced by the 
NAA Classified Advertising Standards Task Forc^_ We examined this real-^ 
collected six representative DTDs that are shown in Table 8. Of the DTDs shown in the table, 
the last three DTDs are quite interesting. DTD 4 contains the metacharacter ? in conjunction 
with the metacharacter *, while DTDs 5 and 6 contain two regular expressions with * *s, one 
nested within the other. 



^ { Deleted: -Section Break (Conanuous)- 



Deleted: (this can be accessed at 

http://www.naa.org/technology/clsstdtfyAd 

exOlO.dtd). 



No. 


Original DTD 


Simplified DTD 


1 


<ENTITY % included-elements 
"audio-clip | blind-box-reply | graphic | linkpi-char | video-clip"> 


a|b|c|d|e 


2 


<ELEMENT communications-contacts 
(phone 1 faxjemail | pager | web-page)*> 


(a|b|c|d|e)* 


3 


<ELEMENT employment-services(employment-service.type; 
employment-service.location * (e.zz-generic-tag)* )> 


ab* c* 


4 


<ENTITY % location"addr* , geographic-area?, city?, 
state-province? ,postal-code?, country?"> 


a* b?c?d? 


' 5 


<ELEMENT transfer-info(transfer-number; (from-to, 
company-id)+,contact-info)*> 


(a(bc)+d)* 


6 


<ELEMENTreal-estate-services(real-estate-service.type, 
real-estate-service.location?, r-e.response-modes*> 
r-e.comment?)* ? 


(ab?c* d?)* 



* ~ " "(Fomiatted Table 



Table 8: Real-life DTD Data Set 



Deleted: \ 



piJALYTY OF D^EIUlEp DTDS 



30 



PATENT Garofalakis 6-1-36-1 1-10 

Synthetic DTD Data Set : The DTDs inferred by XTRACT and DDbE for the synthetic 
data set are presented in Table 9. As shown in the table, XTRACT infers each of the original 
DTDs correctly. In contrast, DDbE computes the accurate DTD for only DTD 1 which is the 
simplest DTD containing no metacharacters. Even for the simple DTDs 2-5, not only is 
DDbE unable to correctly deduce the original DTD, but it also infers a DTD that does not 
encompass the set of input sequences. For instance, one of the input sequences encompassed 
by DTD 2 is gh which is not encompassed by the DTD inferred by DDbE. Thus, while 
XTRACT infers a DTD that encompasses all the input sequences, the DTD returned by 
DDbE may not encompass every input sequence. DTD 4 exemplifies the two typical 
behaviors of DDbE - (1) sequence f that is not fi-equently repeated is appended to both the 
front and the back of the final DTD, and (2) symbols that are repeated frequently are all OR'd 
together and encapsulated by the metacharacter +. For example, DDbE incorrectly identifies 
the term (abcde)* to be (a|b|c|d|e)* which is much more general. Thus, the DDbE tool has a 
tendency to over-generalize when the original DTDs contain regular expressions with * s. 
This same trend to over-generalize can be seen in DTDs 8-10 also. On the other hand, as is 
evident from Table 9, this is not the case for XTRACT which correctly infers every one of the 
original DTDs even for the more complex DTDs 8-10 that contain various combinations of 
sequencing and OR patterns. This clearly demonstrates the effectiveness of the generalization 
module in discovering these patterns and the MDL module in selecting these general 
candidate DTDs as the final DTDs. 

Also, as discussed earlier, DDbE is not very good at factoring DTDs. For instance, 
unlike XTRACT, DDbE is unable to derive the final factored form for DTD 6. Finally, 
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DDbE infers an extremely complex DTD for the simple DTD 7. The results for the synthetic 
data set clearly demonstrate the superiority of XTRACTs approach (based on the 
combination of generalizing, factoring, and selecting using MDL principles) compared to 
DDbE's for the problem of inferring DTDs. 

Real-life DTD Data Set : The DTDs generated by the two algorithms for the real-life 
data set are shown in Table 10. Of the five DTDs, XTRACT is able to infer all five correctly. 
In contrast, DDbE is able to derive accurate DTDs only for DTDs 1 and 2, and an 
approximate DTD for DTD 3. Basically, with an additional factoring step, DDbE could 
obtain the original DTD for DTD 3. Note, however, that DDbE is unable to infer the simple 
DTD 4 that contains the metacharacter ?. In contrast, XTRACT is able to deduce this DTD 
because its factorization step takes into account the identity element "1" and simplifies 
expressions of the form 1 1 a to a?. DTD 5 represents an interesting case where XTRACT is 
able to mine a DTD containing regular expressions containing nested * s. This is due to the 
generalization module that iteratively looks for sequencing patterns. On the other hand, 
DDbE simply over-generalizes the DTD 5 by ORing all the symbols in it and enclosing them 
within the metacharacter +. 



No. 


Original DTD 


DTD Inferred by XTRACT 


DTD Inferred by DDbE 


1 


abcde|ef gh|ij|klm 


abcde|ef gh|ij|klm 


abcde|efgh|ij{klm 


2 


(a|b|c|d|f )♦ gii 


(a|b|c|d|f )♦ gh 


gh(a|b|c|d|f)+gh 


3 


(a|b|c|d)» |e 


(a|b|c|d)* |e 


(e(alcld|b)+e) 


4 


(abcde)* f 


(abcde)* f 


(f(a|e|d|c|b)+f) 


5 


(ab)* |cdef|(ghi)* 


(ab)* Icdef l(ghi)* 


cdef(a|b|g|i|h)+cdef 



[ FotmattedTable 
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6 


abcdef(g|h|ili)(k|l|m|n 
|o) 


abcdef(g|h|i|j)(k|l|in|n|o) 


abedef(i(o|l|m|n|k)|g(o|l|n| 
in|k)|h(m|l|n|k|o)|i(o|l|n|ni| 

k)) 


7 


(a|b|c)d* e* (f gh)* 


(a|b|c)d* e* (f gh)* 


((e|b|a)d+e+|ad+|bd+|e(e+t 
d+)?|ad*|be*))(fih|g)-K(a| 
b|c)d+e+|c(e+|d+)?|a(e+|d 


8 


(a|b)(cdefg)* 

hiiklmnonar rls^* 

■ ■IJIVIlAlllVf L/U^l 1^ J 


(a|b)(cdefg)*hijklmnopq(iis) 
* 


((((a|b)hijabcdef 
EtlblaYclelf 
|e|d|s|r)+((b|a)?hijkamnop 
q)) 


9 


(abed)* |(e|f |g)* 
|h|(ijklm)* 


(abed)* |(i|klm)* |h|(e|f |g)* 


h(a|d|e|b|e|g|f|i|m|l|klj)+h 


10 


a* |(b|c|d|e|f )* 
|gh|(ilj|k)* 1 (Imn)* 


a* |(b|c|d|e|f )* |gh|(iy|k)* 1 
(Imn)* 


(a+|gh)(e|f 
|d|ili|l|n|ni|k|c|b)+(aHH|gh) 



Table 9: DTDs generated by XTRACT and DDbE for Synthetic Data Set 



No. 


Simplified DTD 


DTD Obtained by XTRACT 


DTD obtained by DDbE 


1 


a|b|c|d|e 


a|b|c|d|e 


a|b|c{d|e 


2 


(a|b|c|d|e)* 


(a|b|c|d|e)* 


(a|b|e|d|e)* 


3 


(ab* e* ) 


ab* c* 


(ab+c* )|(ac* ) 


4 


a* b?c?d? 


a* b?c?d? 


(a+b(c|(c?d))?)|((b|a+)?cd)| 
((a+|b)?d)|((a+|b)?c)|a+|b) 


5 


(a(bc)+d)* 


(a(bc)* d)* 


(a|b|c|d)+ 



Table 10: DTDs generated by XTRACT and DDbE for Real-life Data Set 



Deleted: •Section Break (Continuous)* 



^ \ Formatted Table 



The quality of the DTDs inferred by XTRACT was compared with those returned by 
the IBM alphaworks DDbE (Data Descriptors by Example) DTD extraction tool on synthetic 
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as well as real-life DTDs. In the experiments, XTRACT outperformed DDbE by a wide 
margin, and for most DTDs it was able to accurately infer the DTD while DDbE completely 
failed to do so. A number of the DTDs which were correctly identified by XTRACT were 
fairly complex and contained factors, metacharacters, and nested regular expression terms. 
Thus, the results clearly demonstrate the effectiveness of XTRACTs approach that employs 
generalization and factorization to derive a range of general and concise candidate DTDs, and 
then uses a selection step preferably comprising minimum descriptor length (MDL) principles 
as the basis to select from amongst them. 
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What is claimed is: 

1 . A document descriptor extraction method comprising the steps of: 
generalizing input sequences associated with a document to develop general 

sequences, said input sequences reflecting the structure of a document; 

factoring said input sequences and said general sequences to develop factored 
sequences; 

selecting a document descriptor from said input sequences, said general sequences, 
and said factored sequences using minimum descriptor length (MDL) principles. 

2. The method of claim 1 , vs^herein said selecting step comprises the steps of: 
encoding said input sequences, said general sequences, and said factored sequences; 

and 

selecting a document descriptor which encompasses all of said input sequences and 
exhibits a minimum MDL cost. 

3. The method of claim 2, wherein said encoding step employs an algorithm which 
applies a set of rules compri sing: 

seq(D,s) = e if D=s, if D does not contain metacharacters; 
seq(Di...Dk, Si...Sk) = seq(D|,si)...seq(Dk,Sk); 
seq(Di|...|Dm,s) = i seq(Di,s); 

seq(D*,Si...Sk) = {k seq(D,si)...seq(D,Sk) if k>0; 0 otherwise}; 
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wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed to 
encode index i. 

^ { Deleted: -Section Break (Continuous)'] 

,4. The method qf cjaim S^ .wherejn said minimum MDL cost is determined by y 

employing an algorithm to solve a facility location problem (FLP), said FLP modified to 
compute said minimum MDL cost of potential document descriptors. 

5. The method of claim 4, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

6. The method of claim 5, wherein said minimum MDL cost comprises summing a 
first length of bits describing the DTD and a second length of bits for encoding the sequences. 

7. A document descriptor extraction method comprising the steps of: 
generalizing input sequences to develop general sequences, said input sequences 

reflecting the structure of data within a document; 

selecting a document descriptor from said input sequences and said general sequences 
using minimum descriptor length (MDL) principles. 

8. The method of claim 7, wherein said selecting step comprises the steps of: 
encoding said input sequences and said general sequences; and 
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selecting a document descriptor which encompasses all of said input sequences and 
exhibits a minimum MDL cost. 

9. The method of claim 8, wherein said encoding step employs an algorithms which 
applies a set of rules comprising: 

seq(D,s) = e if D=s, if D does not contain metacharacters; 

^ { Deleted: 'Section Break (Continuous)- 

rPkt s.ii -Sk).'!. se_qiDL,Si)».se_q(Dk,Sk),_if D is a concatenation of Dj .„Pk; y 

seq(Di|...|Dm,s) = i seq(Di,s); 

seq(D*,si...sij) = {k seq(D,si)...seq(D,Sk) if k>0; 0 otherwise}; 

wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed to 
encode index i. 

10. The method of claim 9, wherein said minimum MDL cost is determined by 
employing an algorithm to solve a facility location problem (FLP), wherein said FLP is 
modified to compute said minimum MDL cost of potential document descriptors. 

1 1. The method of claim 10, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

12. The method of claim 1 1, wherein said minimum MDL cost comprises summing a 
first length of bits describing the DTD and a second length of bits for encoding the sequences. 

^ { Deleted: \ 
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13. The method of claim 7, further comprising the step of: 
factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available for said step of selecting; 



14. A computer -readable medium encoded with a computer program for generalizing 
input sequences to develop general sequences , said computer program comprising: 
a discover OR patterns procedure; 

^ discover sequence patterns pro_cedure;_and 

a generalize procedure which calls said discover sequence patterns procedure and calls 

said discover OR patterns procedure, wherein said discover OR patterns procedure is nested 

within said discover sequence patterns procedure. 



^ - i Formatted: Indent: First line: 0.5" 
^ - V 
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15. The computei ;-readabje medium o f claim 1 4. _said_co_rnty terj3_rogram_ fo 
comprising a partition procedure called by said discover OR patterns procedure. 

16. A/nethodfor^eneralizing jnput sequences jo devejop^ g^l^E^j ■?gq.4?Pg^j 
comprisingjhestegsof 



^ { Deleted: progrant 
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Deleted: of claim 15, utilizing a 
computer program 



Deleted: as set forth in claim 15.TI 

H 

1 7. The method of claim 16, 

^jscovering ORjattems am^^ ^ ^ Deleted: : 



discovering sequence patterns among said input sequences and OR patterns. 



Deleted: generalizing said input 
sequences to create general sequences 
using said computer program; and^ 



17. The method of claim 16. wherein said step of discovering OR patterns comprises 



the step of partitioning said input sequences. 
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and 



1 8. A document descriptor extraction method comprising the steps of: 
generalizing input sequences, said generalizing step comprising the steps of: 
discovering OR patterns among said input sequences, and 
discovering sequence patterns among said input sequences and OR patterns: 

selecting a document descriptor from said input sequences and said general sequences. 



( Deleted: 18. 1 


Deleted: 17 
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The method of claim jl_8^ wherejn sajd discovering _0R patte_rns_ stgfj cqmprises the_ 
step of partitioning said input sequences. 

20. The method of claim 19, further comprisin g the steps of : 
factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available to said step of selecting. 

^1 . The method of claim_^, wherein said step of selecting eniploys minimum 
descriptor length (MDL) principles. 

I 

22. The method of claim 21, wherein said^ocument descriptor Js a document tjpe 

descriptor (DTD) and said document is an extensible Markup Language (XML) document. 
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20. The method of claim 19, wherein said 
document descriptor is a document type 
descriptor (DTD) and said document is an 
extensible Markup Language (XML) 
document^ 

H 

2 1 . A method for generalizing input 
sequences to develop general sequences 
comprising the steps of:^ 
discovering OR patterns among said input 
sequences; and^ 

discovering sequence patterns among said 
input sequences and OR patterns.^ 



Deleted: step of discovering OR 
patterns comprises the step of partitioning 
said input sequences.^ 

H 

23. A document descriptor extraction 
method, utilizing a metiiod for 
generalizing input sequences as set forth 
in claim 12.^ 

H 

24. The method of claim 23, further 
comprising the steps of:^ 
generalizing said input sequences to create 
general sequences using said method for 
generalizing input sequences; and^ 
selecting a document descriptor from said 
input sequences and said general 
sequences.^ 

H 

25. The method of claim 24, further 
comprising the steps of:^ 

factoring said input sequences and said 
general sequences to develop factored 
sequences, wherein said factored 
sequences are available to said step of 
selecting.^ 
H 
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26. The method of claim 25, wherein said 
step of selecting employs minimum 
descriptor length (MDL) principles.^ 

n 

27. The method of claim 26, wherein said 
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TITLE: DOCUMENT DESCRIPTOR EXTRACTION METHOD 



ABSTRACT OF THE DISCLOSURE 



^ ^ Deleted: The present invention discloses 



^ document descriptor extraction metho 




a 



Deleted: . The document descriptor 
extraction mediod and system 



descriptor by generalizing input sequences within a document; factoring the input sequences 
and generalized input sequences; and selecting a document descriptor from the input 
sequences, generalized sequences, and factored sequences, preferably using minimum 
descriptor length (MDL) principles. Novel algorithms are employed to perform the 
generalizing, factoring, and selecting. 
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